Michal Turek created PARQUET-408:
------------------------------------

             Summary: Shutdown hook in parquet-avro library corrupts data and 
disables logging
                 Key: PARQUET-408
                 URL: https://issues.apache.org/jira/browse/PARQUET-408
             Project: Parquet
          Issue Type: Bug
          Components: parquet-avro
    Affects Versions: 1.8.1
            Reporter: Michal Turek
             Fix For: 1.9.0
         Attachments: parquet-broken-shutdown_2015-12-16.tar.gz

Parquet-avro and probably also other Parquet libraries are not well behaved. It 
registers a shutdown hook that bypasses application shutdown sequence, corrupts 
data written to currently opened Parquet file(s) and disables or reconfigures 
slf4j/logback logger so no further log message is visible.


h3. Scope

Our application is a microservice that handles stop request in form of signal 
SIGTERM, resp. JVM shutdown hook. If it arrives the application will close all 
opened files (writers), release all other resources and gracefully shutdown. We 
are swiching from sequence files to Parquet at the moment and using Maven 
dependency {{org.apache.parquet:parquet-avro:1.8.1}} which is current latest 
version. We are using {{Runtime.getRuntime().addShutdownHook()}} to handle 
SIGTERM.


h3. Example code

See archive in attachment.

- Optionally update version of {{hadoop-client}} in {{pom.xml}} to match your 
Hadoop.
- Use {{mvn package}} to compile.
- Copy Hadoop configuration XMLs to {{config}} directory.
- Update configuration at the top of {{ParquetBrokenShutdown}} class.
- Execute {{ParquetBrokenShutdown}} class.
- Send SIGTERM to shutdown the application ({{kill PID}}).


h3. Initial analysis

Parquet library tries to care about application shutdown but this introduces 
more issues than solves. If application is writing to a file and the library 
asynchronously decides to close underlying writer, data loss will occur. The 
handle is just closed and all remaining records can't be written.

{noformat}
Writing to HDFS/Parquet failed
java.io.IOException: can not write PageHeader(type:DICTIONARY_PAGE, 
uncompressed_page_size:14, compressed_page_size:34, 
dictionary_page_header:DictionaryPageHeader(num_values:1, encoding:PLAIN))
        at org.apache.parquet.format.Util.write(Util.java:224)
        at org.apache.parquet.format.Util.writePageHeader(Util.java:61)
        at 
org.apache.parquet.format.converter.ParquetMetadataConverter.writeDictionaryPageHeader(ParquetMetadataConverter.java:760)
        at 
org.apache.parquet.hadoop.ParquetFileWriter.writeDictionaryPage(ParquetFileWriter.java:307)
        at 
org.apache.parquet.hadoop.ColumnChunkPageWriteStore$ColumnChunkPageWriter.writeToFileWriter(ColumnChunkPageWriteStore.java:179)
        at 
org.apache.parquet.hadoop.ColumnChunkPageWriteStore.flushToFileWriter(ColumnChunkPageWriteStore.java:238)
        at 
org.apache.parquet.hadoop.InternalParquetRecordWriter.flushRowGroupToStore(InternalParquetRecordWriter.java:165)
        at 
org.apache.parquet.hadoop.InternalParquetRecordWriter.close(InternalParquetRecordWriter.java:113)
        at org.apache.parquet.hadoop.ParquetWriter.close(ParquetWriter.java:297)
        at 
com.avast.bugreport.parquet.ParquetBrokenShutdown.writeParquetFile(ParquetBrokenShutdown.java:86)
        at 
com.avast.bugreport.parquet.ParquetBrokenShutdown.run(ParquetBrokenShutdown.java:53)
        at 
com.avast.bugreport.parquet.ParquetBrokenShutdown.main(ParquetBrokenShutdown.java:153)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:497)
        at com.intellij.rt.execution.application.AppMain.main(AppMain.java:144)
Caused by: parquet.org.apache.thrift.transport.TTransportException: 
java.nio.channels.ClosedChannelException
        at 
parquet.org.apache.thrift.transport.TIOStreamTransport.write(TIOStreamTransport.java:147)
        at 
parquet.org.apache.thrift.transport.TTransport.write(TTransport.java:105)
        at 
parquet.org.apache.thrift.protocol.TCompactProtocol.writeByteDirect(TCompactProtocol.java:424)
        at 
parquet.org.apache.thrift.protocol.TCompactProtocol.writeByteDirect(TCompactProtocol.java:431)
        at 
parquet.org.apache.thrift.protocol.TCompactProtocol.writeFieldBeginInternal(TCompactProtocol.java:194)
        at 
parquet.org.apache.thrift.protocol.TCompactProtocol.writeFieldBegin(TCompactProtocol.java:176)
        at 
org.apache.parquet.format.InterningProtocol.writeFieldBegin(InterningProtocol.java:74)
        at org.apache.parquet.format.PageHeader.write(PageHeader.java:918)
        at org.apache.parquet.format.Util.write(Util.java:222)
        ... 16 more
Caused by: java.nio.channels.ClosedChannelException
        at 
org.apache.hadoop.hdfs.DFSOutputStream.checkClosed(DFSOutputStream.java:1635)
        at org.apache.hadoop.fs.FSOutputSummer.write(FSOutputSummer.java:104)
        at 
org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStream.java:58)
        at java.io.DataOutputStream.write(DataOutputStream.java:107)
        at 
parquet.org.apache.thrift.transport.TIOStreamTransport.write(TIOStreamTransport.java:145)
        ... 24 more
{noformat}

Stored file is also currupted in such case. It contains only four bytes "PAR1" 
header so actually all data are lost.

Together with that whole application logging using slf4j/logback is 
reconfigured or completely disabled so no more application's log messages 
appear in the output. This may be related to PARQUET-369 and will be hopefully 
fixed in PARQUET-401. No such issue was observed when sequence files were used 
in the past.


h3. Known workaround

Java executes threads of shutdown hooks independently and in parallel, but it 
is possible for the application to force its own shutdown to be executed before 
the Parquet's one. Parquet library uses Hadoop's {{ShutdownHookManager}} to 
register the hook which has a possibility to define a priority.

{noformat}
if (!USE_WORKAROUND) {
        Runtime.getRuntime().addShutdownHook(shutdownHook);
} else {
        int maxPriority = Integer.MAX_VALUE; // Execute application hook before 
hook in Hadoop library
        ShutdownHookManager.get().addShutdownHook(shutdownHook, maxPriority);
}
{noformat}

This workaround would be impossible if there are two or more such nasty 
libraries.

No workaround has been found for the second issue with disabled logging until 
now.


h3. Suggestions for fix

- Never touch logging configuration in libraries.
-- Use slf4j-api interfaces and just output log messages.
-- Application is responsible for configuration of logging.
- Do not register any shutdown hook in libraries that releases resources that 
are still possibly in use.
-- Instead provide an API to the application to close/free the allocated 
resources (already present).
-- The application is responsible to call close on all Closeables during 
shutdown in correct order and correct time.
-- No library has enough information for that.
- Be well behaved library, do not try to be smarter than the application, it 
isn't possible.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to