[jira] [Commented] (PARQUET-531) Can't read past first page in a column
[ https://issues.apache.org/jira/browse/PARQUET-531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15149593#comment-15149593 ] Deepak Majeti commented on PARQUET-531: --- [~wesmckinn] I will look into this once the PARQUET-499 patch is completed. Looks like some variables are not being re-initialized in the scanner. > Can't read past first page in a column > -- > > Key: PARQUET-531 > URL: https://issues.apache.org/jira/browse/PARQUET-531 > Project: Parquet > Issue Type: Bug > Components: parquet-cpp > Environment: Ubuntu Linux 14.04 (no obvious platform dependence), > Parquet file created by Apache Spark 1.5.0 on the same platform. >Reporter: Spiro Michaylov >Assignee: Deepak Majeti > Attachments: > part-r-00031-e5d9a4ef-d73e-406c-8c2f-9ad1f20ebf8e.gz.parquet > > > Building the code as of 2/14/2015 and adding the obvious three lines of code > to serialized-page.cc to enable the newly added CompressionCodec::GZIP: > {code} > case parquet::CompressionCodec::GZIP: >decompressor_.reset(new GZipCodec()); >break; > {code} > I try to run the parquet_reader example on the column I'm about to attach, > which was created by Apache Spark 1.5.0. It works surprisingly well until it > hits the end of the first page, where it dies with > {quote} > Parquet error: Value was non-null, but has not been buffered > {quote} > I realize you may be reluctant to look at this because (a) the GZip support > is new and (b) I had to modify the code to enable it, but actually things > seem to decompress just fine (congratulations: this is awesome!): looking at > the problem in the debugger and tracing through a bit it seems to me like the > buffering is a bit screwed up in general -- some kind of confusion between > the buffering at the Scanner and Reader levels. I can reproduce the problem > by reading through just a single column too. > It fails after 128 rows, which is suspicious given this line in > column/scanner.h: > {code} > DEFAULT_SCANNER_BATCH_SIZE = 128; > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PARQUET-531) Can't read past first page in a column
[ https://issues.apache.org/jira/browse/PARQUET-531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15149557#comment-15149557 ] Wes McKinney commented on PARQUET-531: -- [~mdeepak] I just tried reading this file after PARQUET-515 and PARQUET-523 were applied and it appears the bug lies in the Scanner, so we can leave this open until we have a test case reproduction > Can't read past first page in a column > -- > > Key: PARQUET-531 > URL: https://issues.apache.org/jira/browse/PARQUET-531 > Project: Parquet > Issue Type: Bug > Components: parquet-cpp > Environment: Ubuntu Linux 14.04 (no obvious platform dependence), > Parquet file created by Apache Spark 1.5.0 on the same platform. >Reporter: Spiro Michaylov > Attachments: > part-r-00031-e5d9a4ef-d73e-406c-8c2f-9ad1f20ebf8e.gz.parquet > > > Building the code as of 2/14/2015 and adding the obvious three lines of code > to serialized-page.cc to enable the newly added CompressionCodec::GZIP: > {code} > case parquet::CompressionCodec::GZIP: >decompressor_.reset(new GZipCodec()); >break; > {code} > I try to run the parquet_reader example on the column I'm about to attach, > which was created by Apache Spark 1.5.0. It works surprisingly well until it > hits the end of the first page, where it dies with > {quote} > Parquet error: Value was non-null, but has not been buffered > {quote} > I realize you may be reluctant to look at this because (a) the GZip support > is new and (b) I had to modify the code to enable it, but actually things > seem to decompress just fine (congratulations: this is awesome!): looking at > the problem in the debugger and tracing through a bit it seems to me like the > buffering is a bit screwed up in general -- some kind of confusion between > the buffering at the Scanner and Reader levels. I can reproduce the problem > by reading through just a single column too. > It fails after 128 rows, which is suspicious given this line in > column/scanner.h: > {code} > DEFAULT_SCANNER_BATCH_SIZE = 128; > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (PARQUET-514) Automate coveralls.io updates in Travis CI
[ https://issues.apache.org/jira/browse/PARQUET-514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated PARQUET-514: - Description: The repo has been enabled in INFRA-11273, so all that's left is to work on the Travis CI build matrix and add coveralls to one of the builds (rather than running it for all of them) (was: This is actually fairly seamless -- if you run the codecov upload from within the Travis CI job, you do not need the codecov API token. There may be some twiddling to do to set up a build matrix that includes only a single job uploading code coverage (right now the build matrix is somewhat monolithic and hard to customize in a granular way).) Summary: Automate coveralls.io updates in Travis CI (was: Automate codecov.io updates in Travis CI) > Automate coveralls.io updates in Travis CI > -- > > Key: PARQUET-514 > URL: https://issues.apache.org/jira/browse/PARQUET-514 > Project: Parquet > Issue Type: Improvement > Components: parquet-cpp >Reporter: Wes McKinney >Priority: Minor > > The repo has been enabled in INFRA-11273, so all that's left is to work on > the Travis CI build matrix and add coveralls to one of the builds (rather > than running it for all of them) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (PARQUET-523) Test ColumnReader on a column chunk containing multiple data pages
[ https://issues.apache.org/jira/browse/PARQUET-523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Deepak Majeti resolved PARQUET-523. --- Resolution: Resolved Issue resolved by pull request 51 https://github.com/apache/parquet-cpp/pull/51 > Test ColumnReader on a column chunk containing multiple data pages > > > Key: PARQUET-523 > URL: https://issues.apache.org/jira/browse/PARQUET-523 > Project: Parquet > Issue Type: Improvement > Components: parquet-cpp >Reporter: Wes McKinney >Assignee: Deepak Majeti > > Our test cases currently only cover data containing a single data page -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (PARQUET-515) Add "Reset" to LevelEncoder and LevelDecoder
[ https://issues.apache.org/jira/browse/PARQUET-515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Le Dem resolved PARQUET-515. --- Resolution: Fixed Fix Version/s: cpp-0.1 Issue resolved by pull request 51 [https://github.com/apache/parquet-cpp/pull/51] > Add "Reset" to LevelEncoder and LevelDecoder > > > Key: PARQUET-515 > URL: https://issues.apache.org/jira/browse/PARQUET-515 > Project: Parquet > Issue Type: Improvement > Components: parquet-cpp >Reporter: Deepak Majeti >Assignee: Deepak Majeti > Fix For: cpp-0.1 > > > The rle-encoder and rle-decoder classes have a "Reset" method as a quick way > to initialize the objects. This method resets the encoder an decoder state to > work on a new buffer without the need to create a new object at every DATA > PAGE granularity. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PARQUET-514) Automate codecov.io updates in Travis CI
[ https://issues.apache.org/jira/browse/PARQUET-514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15149169#comment-15149169 ] Wes McKinney commented on PARQUET-514: -- I've asked ASF infra to help with this, see INFRA-11273 > Automate codecov.io updates in Travis CI > > > Key: PARQUET-514 > URL: https://issues.apache.org/jira/browse/PARQUET-514 > Project: Parquet > Issue Type: Improvement > Components: parquet-cpp >Reporter: Wes McKinney >Priority: Minor > > This is actually fairly seamless -- if you run the codecov upload from within > the Travis CI job, you do not need the codecov API token. There may be some > twiddling to do to set up a build matrix that includes only a single job > uploading code coverage (right now the build matrix is somewhat monolithic > and hard to customize in a granular way). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (PARQUET-518) Review usages of size_t and unsigned integers generally per Google style guide
[ https://issues.apache.org/jira/browse/PARQUET-518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney reassigned PARQUET-518: Assignee: Wes McKinney > Review usages of size_t and unsigned integers generally per Google style guide > -- > > Key: PARQUET-518 > URL: https://issues.apache.org/jira/browse/PARQUET-518 > Project: Parquet > Issue Type: Improvement > Components: parquet-cpp >Reporter: Wes McKinney >Assignee: Wes McKinney >Priority: Minor > > The Google style guide recommends generally avoiding unsigned integers for > the bugs they can silently introduce. > https://google.github.io/styleguide/cppguide.html#Integer_Types -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (PARQUET-458) Implement support for DataPageV2
[ https://issues.apache.org/jira/browse/PARQUET-458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated PARQUET-458: - Priority: Minor (was: Major) > Implement support for DataPageV2 > > > Key: PARQUET-458 > URL: https://issues.apache.org/jira/browse/PARQUET-458 > Project: Parquet > Issue Type: New Feature > Components: parquet-cpp >Reporter: Wes McKinney >Priority: Minor > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (PARQUET-520) Add support for zero-copy InputStreams on memory-mapped files
[ https://issues.apache.org/jira/browse/PARQUET-520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney reassigned PARQUET-520: Assignee: Wes McKinney > Add support for zero-copy InputStreams on memory-mapped files > - > > Key: PARQUET-520 > URL: https://issues.apache.org/jira/browse/PARQUET-520 > Project: Parquet > Issue Type: Improvement > Components: parquet-cpp >Reporter: Wes McKinney >Assignee: Wes McKinney > > Noted this while working on PARQUET-497. If we are using a memory-mapped > file, then copying data into a {{ScopedInMemoryInputStream}} as we are now is > unnecessary and will yield improved performance. Perhaps this should be made > a property of the {{InputStream}} (i.e. indicate whether it support zero-copy > reads, and the returned buffer does not become invalid after future reads as > long as the stream -- the memory map specifically in this example -- is alive) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PARQUET-481) Refactor and expand reader-test
[ https://issues.apache.org/jira/browse/PARQUET-481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15148947#comment-15148947 ] Deepak Majeti commented on PARQUET-481: --- I am adding few tests to check NULLS for the Scanner as part of PARQUET-532. Making sure we don't duplicate work. > Refactor and expand reader-test > --- > > Key: PARQUET-481 > URL: https://issues.apache.org/jira/browse/PARQUET-481 > Project: Parquet > Issue Type: Improvement > Components: parquet-cpp >Affects Versions: cpp-0.1 >Reporter: Aliaksei Sandryhaila >Assignee: Aliaksei Sandryhaila > Fix For: cpp-0.1 > > > reader-test currently tests with a parquet file and only verifies that we can > read it, not the correctness of the output. > Proposed changes: > - Remove the use of parquet files and use mock objects instead. > - Move tests for Scanner to scanner-test.cc > - Get rid of DebugPrint() tests, move to ParquetFilePrinter as a part of > PARQUET-508. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (PARQUET-535) Make writeAllFields more efficient in proto-parquet component
Chen Song created PARQUET-535: - Summary: Make writeAllFields more efficient in proto-parquet component Key: PARQUET-535 URL: https://issues.apache.org/jira/browse/PARQUET-535 Project: Parquet Issue Type: Bug Components: parquet-mr Reporter: Chen Song Assignee: Chen Song Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.4#6332)
HashJoin throws ParquetDecodingException with input as ParquetTupleScheme
Hi, I am facing problem while using HashJoin with input using ParquetTupleScheme. I have two source taps of which one is using TextDelimited scheme and the other source tap is using ParquetTupleScheme. I am performing a HashJoin and writing the data as Delimited file. The program runs successfully on local mode but when i tried to run it on cluster, it gives following error : parquet.io.ParquetDecodingException: Can not read value at 0 in block -1 in file hdfs://Hostname:8020/user/username/testData/lookup-file.parquet at parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:211) at parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:144) at parquet.hadoop.mapred.DeprecatedParquetInputFormat$RecordReaderWrapper.(DeprecatedParquetInputFormat.java:91) at parquet.hadoop.mapred.DeprecatedParquetInputFormat.getRecordReader(DeprecatedParquetInputFormat.java:42) at cascading.tap.hadoop.io.MultiRecordReaderIterator.makeReader(MultiRecordReaderIterator.java:123) at cascading.tap.hadoop.io.MultiRecordReaderIterator.getNextReader(MultiRecordReaderIterator.java:172) at cascading.tap.hadoop.io.MultiRecordReaderIterator.hasNext(MultiRecordReaderIterator.java:133) at cascading.tuple.TupleEntrySchemeIterator.(TupleEntrySchemeIterator.java:94) at cascading.tap.hadoop.io.HadoopTupleEntrySchemeIterator.(HadoopTupleEntrySchemeIterator.java:49) at cascading.tap.hadoop.io.HadoopTupleEntrySchemeIterator.(HadoopTupleEntrySchemeIterator.java:44) at cascading.tap.hadoop.Hfs.openForRead(Hfs.java:439) at cascading.tap.hadoop.Hfs.openForRead(Hfs.java:108) at cascading.flow.stream.element.SourceStage.map(SourceStage.java:82) at cascading.flow.stream.element.SourceStage.run(SourceStage.java:66) at cascading.flow.hadoop.FlowMapper.run(FlowMapper.java:139) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:453) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158) Caused by: java.lang.NullPointerException at parquet.hadoop.util.counters.mapred.MapRedCounterAdapter.increment(MapRedCounterAdapter.java:34) at parquet.hadoop.util.counters.BenchmarkCounter.incrementTotalBytes(BenchmarkCounter.java:75) at parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:349) at parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:114) at parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:191) ... 21 more Below are the UseCase: public static void main(String[] args) throws IOException { Configuration conf = new Configuration(); String[] otherArgs; otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs(); String argsString = ""; for (String arg : otherArgs) { argsString = argsString + " " + arg; } System.out.println("After processing arguments are:" + argsString); Properties properties = new Properties(); properties.putAll(conf.getValByRegex(".*")); String OutputPath = "testData/BasicEx_Output"; Class types1[] = { String.class, String.class, String.class }; Fields f1 = new Fields("id1", "city1", "state"); Tap source = new Hfs(new TextDelimited(f1, "|", "", types1, false), "main-txt-file.dat"); Pipe pipe = new Pipe("ReadWrite"); Scheme pScheme = new ParquetTupleScheme(); Tap source2 = new Hfs(pScheme, "testData/lookup-file.parquet"); Pipe pipe2 = new Pipe("ReadWrite2"); Pipe tokenPipe = new HashJoin(pipe, new Fields("id1"), pipe2, new Fields("id"), new LeftJoin()); Tap sink = new Hfs(new TextDelimited(f1, true, "|"), OutputPath, SinkMode.REPLACE); FlowDef flowDef1 = FlowDef.flowDef().addSource(pipe, source).addSource(pipe2, source2).addTailSink(tokenPipe, sink); new Hadoop2MR1FlowConnector(properties).connect(flowDef1).complete(); } I have attached the input files for the reference . Please help me in solving this issue. I have asked the same question on cascading google group and below is response for it : André Kelpe This looks like a bug caused by a wrong assumption in parquet. I fixed a similar thing 2 years ago in parquet: https://github.com/Parquet/parquet-mr/pull/388/ Can you check with the upstream project? It looks like it is their problem and not a problem in Cascading. - André - show
[jira] [Resolved] (PARQUET-408) Shutdown hook in parquet-avro library corrupts data and disables logging
[ https://issues.apache.org/jira/browse/PARQUET-408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michal Turek resolved PARQUET-408. -- Resolution: Won't Fix Fix Version/s: (was: 1.9.0) I'm closing the issue, workaround is known and can be used in the applications and PARQUET-401 will introduce a final fix. > Shutdown hook in parquet-avro library corrupts data and disables logging > > > Key: PARQUET-408 > URL: https://issues.apache.org/jira/browse/PARQUET-408 > Project: Parquet > Issue Type: Bug > Components: parquet-avro >Affects Versions: 1.8.1 >Reporter: Michal Turek >Assignee: Michal Turek > Attachments: parquet-broken-shutdown_2015-12-16.tar.gz > > > Parquet-avro and probably also other Parquet libraries are not well behaved. > It registers a shutdown hook that bypasses application shutdown sequence, > corrupts data written to currently opened Parquet file(s) and disables or > reconfigures slf4j/logback logger so no further log message is visible. > h3. Scope > Our application is a microservice that handles stop request in form of signal > SIGTERM, resp. JVM shutdown hook. If it arrives the application will close > all opened files (writers), release all other resources and gracefully > shutdown. We are swiching from sequence files to Parquet at the moment and > using Maven dependency {{org.apache.parquet:parquet-avro:1.8.1}} which is > current latest version. We are using > {{Runtime.getRuntime().addShutdownHook()}} to handle SIGTERM. > h3. Example code > See archive in attachment. > - Optionally update version of {{hadoop-client}} in {{pom.xml}} to match your > Hadoop. > - Use {{mvn package}} to compile. > - Copy Hadoop configuration XMLs to {{config}} directory. > - Update configuration at the top of {{ParquetBrokenShutdown}} class. > - Execute {{ParquetBrokenShutdown}} class. > - Send SIGTERM to shutdown the application ({{kill PID}}). > h3. Initial analysis > Parquet library tries to care about application shutdown but this introduces > more issues than solves. If application is writing to a file and the library > asynchronously decides to close underlying writer, data loss will occur. The > handle is just closed and all remaining records can't be written. > {noformat} > Writing to HDFS/Parquet failed > java.io.IOException: can not write PageHeader(type:DICTIONARY_PAGE, > uncompressed_page_size:14, compressed_page_size:34, > dictionary_page_header:DictionaryPageHeader(num_values:1, encoding:PLAIN)) > at org.apache.parquet.format.Util.write(Util.java:224) > at org.apache.parquet.format.Util.writePageHeader(Util.java:61) > at > org.apache.parquet.format.converter.ParquetMetadataConverter.writeDictionaryPageHeader(ParquetMetadataConverter.java:760) > at > org.apache.parquet.hadoop.ParquetFileWriter.writeDictionaryPage(ParquetFileWriter.java:307) > at > org.apache.parquet.hadoop.ColumnChunkPageWriteStore$ColumnChunkPageWriter.writeToFileWriter(ColumnChunkPageWriteStore.java:179) > at > org.apache.parquet.hadoop.ColumnChunkPageWriteStore.flushToFileWriter(ColumnChunkPageWriteStore.java:238) > at > org.apache.parquet.hadoop.InternalParquetRecordWriter.flushRowGroupToStore(InternalParquetRecordWriter.java:165) > at > org.apache.parquet.hadoop.InternalParquetRecordWriter.close(InternalParquetRecordWriter.java:113) > at org.apache.parquet.hadoop.ParquetWriter.close(ParquetWriter.java:297) > at > com.avast.bugreport.parquet.ParquetBrokenShutdown.writeParquetFile(ParquetBrokenShutdown.java:86) > at > com.avast.bugreport.parquet.ParquetBrokenShutdown.run(ParquetBrokenShutdown.java:53) > at > com.avast.bugreport.parquet.ParquetBrokenShutdown.main(ParquetBrokenShutdown.java:153) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:497) > at com.intellij.rt.execution.application.AppMain.main(AppMain.java:144) > Caused by: parquet.org.apache.thrift.transport.TTransportException: > java.nio.channels.ClosedChannelException > at > parquet.org.apache.thrift.transport.TIOStreamTransport.write(TIOStreamTransport.java:147) > at > parquet.org.apache.thrift.transport.TTransport.write(TTransport.java:105) > at > parquet.org.apache.thrift.protocol.TCompactProtocol.writeByteDirect(TCompactProtocol.java:424) > at > parquet.org.apache.thrift.protocol.TCompactProtocol.writeByteDirect(TCompactProtocol.java:431) > at >
[jira] [Commented] (PARQUET-408) Shutdown hook in parquet-avro library corrupts data and disables logging
[ https://issues.apache.org/jira/browse/PARQUET-408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15148540#comment-15148540 ] Michal Turek commented on PARQUET-408: -- Ok, I was debugging for a while and finally found what is happening. Logging is not disabled on shutdown at all but shutdown hook of {{java.util.logging}} corrupts stdout stream. Messages that go to {{System.out}} using {{ConsoleAppender}} are not written, because the stream is closed. On the other hand messages that go to a file using {{RollingFileAppender}} are correctly there. Bellow is a detailed analysis if anyone is interested. Note there is a workaround at the very end. Effective logback.xml: {noformat} %d{-MM-dd HH:mm:ss.SSS} %-5level %-35logger{35} [%thread]: %msg \(%file:%line\)%n%xThrowable{full} test.log test.log.%d{-MM-dd}_%i.gz 30 250MB %d{-MM-dd HH:mm:ss.SSS} %-5level %-45logger{45} [%thread]: %msg%n%xThrowable{full} {noformat} Output on console or in JIdea IDE: {noformat} Logging test: starting application Logging test: application started 2016-02-16 12:39:58.922 INFO c.a.b.parquet.ParquetBrokenShutdown [main]: === STARTING APPLICATION === (ParquetBrokenShutdown.java:177) 2016-02-16 12:39:58.928 TRACE c.a.b.parquet.ParquetBrokenShutdown [main]: Logging test: starting application (ParquetBrokenShutdown.java:178) 2016-02-16 12:39:58.928 DEBUG c.a.b.parquet.ParquetBrokenShutdown [main]: Logging test: starting application (ParquetBrokenShutdown.java:179) 2016-02-16 12:39:58.928 INFO c.a.b.parquet.ParquetBrokenShutdown [main]: Logging test: starting application (ParquetBrokenShutdown.java:180) 2016-02-16 12:39:58.928 WARN c.a.b.parquet.ParquetBrokenShutdown [main]: Logging test: starting application (ParquetBrokenShutdown.java:181) 2016-02-16 12:39:58.928 ERROR c.a.b.parquet.ParquetBrokenShutdown [main]: Logging test: starting application (ParquetBrokenShutdown.java:182) 2016-02-16 12:39:58.929 INFO c.a.b.parquet.ParquetBrokenShutdown [main]: === APPLICATION STARTED === (ParquetBrokenShutdown.java:177) 2016-02-16 12:39:58.930 TRACE c.a.b.parquet.ParquetBrokenShutdown [main]: Logging test: application started (ParquetBrokenShutdown.java:178) 2016-02-16 12:39:58.930 DEBUG c.a.b.parquet.ParquetBrokenShutdown [main]: Logging test: application started (ParquetBrokenShutdown.java:179) 2016-02-16 12:39:58.930 INFO c.a.b.parquet.ParquetBrokenShutdown [main]: Logging test: application started (ParquetBrokenShutdown.java:180) 2016-02-16 12:39:58.930 WARN c.a.b.parquet.ParquetBrokenShutdown [main]: Logging test: application started (ParquetBrokenShutdown.java:181) 2016-02-16 12:39:58.931 ERROR c.a.b.parquet.ParquetBrokenShutdown [main]: Logging test: application started (ParquetBrokenShutdown.java:182) 2016-02-16 12:39:59.188 DEBUG c.a.b.parquet.ParquetBrokenShutdown [main]: Opening Parquet file: hdfs://nameservice1/user/turek/bugreport.parquet (ParquetBrokenShutdown.java:95) 2016-02-16 12:39:59.335 WARN o.a.hadoop.util.NativeCodeLoader[main]: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable (NativeCodeLoader.java:62) 2016-02-16 12:40:00.228 INFO o.a.hadoop.io.compress.CodecPool[main]: Got brand-new compressor [.gz] (CodecPool.java:151) 2016-02-16 12:40:00.295 DEBUG c.a.b.parquet.ParquetBrokenShutdown [main]: Parquet file opened: hdfs://nameservice1/user/turek/bugreport.parquet (ParquetBrokenShutdown.java:105) 2016-02-16 12:40:00.297 TRACE c.a.b.parquet.ParquetBrokenShutdown [main]: Writing record: {"test": "test value"} (ParquetBrokenShutdown.java:111) 2016-02-16 12:40:01.299 TRACE c.a.b.parquet.ParquetBrokenShutdown [main]: Writing record: {"test": "test value"} (ParquetBrokenShutdown.java:111) 2016-02-16 12:40:02.299 TRACE c.a.b.parquet.ParquetBrokenShutdown [main]: Writing record: {"test": "test value"} (ParquetBrokenShutdown.java:111) 2016-02-16 12:40:03.301 TRACE c.a.b.parquet.ParquetBrokenShutdown [main]: Writing record: {"test": "test value"} (ParquetBrokenShutdown.java:111) 2016-02-16 12:40:03.585 INFO c.a.b.parquet.ParquetBrokenShutdown [Thread-0]: === STOPPING APPLICATION === (ParquetBrokenShutdown.java:177) Logging test: stopping application Logging test: application finished Logging test: shutdown hook finished Process finished with exit code 143 {noformat} Output in {{test.log}} file: {noformat} 2016-02-16 12:39:58.922 INFO c.a.bugreport.parquet.ParquetBrokenShutdown [main]: === STARTING APPLICATION === 2016-02-16 12:39:58.928 TRACE c.a.bugreport.parquet.ParquetBrokenShutdown [main]: Logging test: starting application
[jira] [Assigned] (PARQUET-408) Shutdown hook in parquet-avro library corrupts data and disables logging
[ https://issues.apache.org/jira/browse/PARQUET-408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michal Turek reassigned PARQUET-408: Assignee: Michal Turek > Shutdown hook in parquet-avro library corrupts data and disables logging > > > Key: PARQUET-408 > URL: https://issues.apache.org/jira/browse/PARQUET-408 > Project: Parquet > Issue Type: Bug > Components: parquet-avro >Affects Versions: 1.8.1 >Reporter: Michal Turek >Assignee: Michal Turek > Fix For: 1.9.0 > > Attachments: parquet-broken-shutdown_2015-12-16.tar.gz > > > Parquet-avro and probably also other Parquet libraries are not well behaved. > It registers a shutdown hook that bypasses application shutdown sequence, > corrupts data written to currently opened Parquet file(s) and disables or > reconfigures slf4j/logback logger so no further log message is visible. > h3. Scope > Our application is a microservice that handles stop request in form of signal > SIGTERM, resp. JVM shutdown hook. If it arrives the application will close > all opened files (writers), release all other resources and gracefully > shutdown. We are swiching from sequence files to Parquet at the moment and > using Maven dependency {{org.apache.parquet:parquet-avro:1.8.1}} which is > current latest version. We are using > {{Runtime.getRuntime().addShutdownHook()}} to handle SIGTERM. > h3. Example code > See archive in attachment. > - Optionally update version of {{hadoop-client}} in {{pom.xml}} to match your > Hadoop. > - Use {{mvn package}} to compile. > - Copy Hadoop configuration XMLs to {{config}} directory. > - Update configuration at the top of {{ParquetBrokenShutdown}} class. > - Execute {{ParquetBrokenShutdown}} class. > - Send SIGTERM to shutdown the application ({{kill PID}}). > h3. Initial analysis > Parquet library tries to care about application shutdown but this introduces > more issues than solves. If application is writing to a file and the library > asynchronously decides to close underlying writer, data loss will occur. The > handle is just closed and all remaining records can't be written. > {noformat} > Writing to HDFS/Parquet failed > java.io.IOException: can not write PageHeader(type:DICTIONARY_PAGE, > uncompressed_page_size:14, compressed_page_size:34, > dictionary_page_header:DictionaryPageHeader(num_values:1, encoding:PLAIN)) > at org.apache.parquet.format.Util.write(Util.java:224) > at org.apache.parquet.format.Util.writePageHeader(Util.java:61) > at > org.apache.parquet.format.converter.ParquetMetadataConverter.writeDictionaryPageHeader(ParquetMetadataConverter.java:760) > at > org.apache.parquet.hadoop.ParquetFileWriter.writeDictionaryPage(ParquetFileWriter.java:307) > at > org.apache.parquet.hadoop.ColumnChunkPageWriteStore$ColumnChunkPageWriter.writeToFileWriter(ColumnChunkPageWriteStore.java:179) > at > org.apache.parquet.hadoop.ColumnChunkPageWriteStore.flushToFileWriter(ColumnChunkPageWriteStore.java:238) > at > org.apache.parquet.hadoop.InternalParquetRecordWriter.flushRowGroupToStore(InternalParquetRecordWriter.java:165) > at > org.apache.parquet.hadoop.InternalParquetRecordWriter.close(InternalParquetRecordWriter.java:113) > at org.apache.parquet.hadoop.ParquetWriter.close(ParquetWriter.java:297) > at > com.avast.bugreport.parquet.ParquetBrokenShutdown.writeParquetFile(ParquetBrokenShutdown.java:86) > at > com.avast.bugreport.parquet.ParquetBrokenShutdown.run(ParquetBrokenShutdown.java:53) > at > com.avast.bugreport.parquet.ParquetBrokenShutdown.main(ParquetBrokenShutdown.java:153) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:497) > at com.intellij.rt.execution.application.AppMain.main(AppMain.java:144) > Caused by: parquet.org.apache.thrift.transport.TTransportException: > java.nio.channels.ClosedChannelException > at > parquet.org.apache.thrift.transport.TIOStreamTransport.write(TIOStreamTransport.java:147) > at > parquet.org.apache.thrift.transport.TTransport.write(TTransport.java:105) > at > parquet.org.apache.thrift.protocol.TCompactProtocol.writeByteDirect(TCompactProtocol.java:424) > at > parquet.org.apache.thrift.protocol.TCompactProtocol.writeByteDirect(TCompactProtocol.java:431) > at > parquet.org.apache.thrift.protocol.TCompactProtocol.writeFieldBeginInternal(TCompactProtocol.java:194) > at >
[jira] [Comment Edited] (PARQUET-408) Shutdown hook in parquet-avro library corrupts data and disables logging
[ https://issues.apache.org/jira/browse/PARQUET-408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15148323#comment-15148323 ] Michal Turek edited comment on PARQUET-408 at 2/16/16 10:01 AM: I have just tried to add `log4j.xml` and `log4j2.xml` on classpath but it didn't help as expected. http://coders-kitchen.com/2014/01/29/tip-disable-log4j-2s-shutdown-handler/ {noformat} {noformat} Adding {{import org.apache.logging.log4j.Logger;}} introduces a compiler error so I'm pretty sure, log4j2 is not the cause. Can you update the attached sample code and make it functional to demonstrate your fix, please? was (Author: tu...@avast.com): I have just tried to add `log4j.xml` and `log4j2.xml` on classpath but it didn't help as expected. > Shutdown hook in parquet-avro library corrupts data and disables logging > > > Key: PARQUET-408 > URL: https://issues.apache.org/jira/browse/PARQUET-408 > Project: Parquet > Issue Type: Bug > Components: parquet-avro >Affects Versions: 1.8.1 >Reporter: Michal Turek > Fix For: 1.9.0 > > Attachments: parquet-broken-shutdown_2015-12-16.tar.gz > > > Parquet-avro and probably also other Parquet libraries are not well behaved. > It registers a shutdown hook that bypasses application shutdown sequence, > corrupts data written to currently opened Parquet file(s) and disables or > reconfigures slf4j/logback logger so no further log message is visible. > h3. Scope > Our application is a microservice that handles stop request in form of signal > SIGTERM, resp. JVM shutdown hook. If it arrives the application will close > all opened files (writers), release all other resources and gracefully > shutdown. We are swiching from sequence files to Parquet at the moment and > using Maven dependency {{org.apache.parquet:parquet-avro:1.8.1}} which is > current latest version. We are using > {{Runtime.getRuntime().addShutdownHook()}} to handle SIGTERM. > h3. Example code > See archive in attachment. > - Optionally update version of {{hadoop-client}} in {{pom.xml}} to match your > Hadoop. > - Use {{mvn package}} to compile. > - Copy Hadoop configuration XMLs to {{config}} directory. > - Update configuration at the top of {{ParquetBrokenShutdown}} class. > - Execute {{ParquetBrokenShutdown}} class. > - Send SIGTERM to shutdown the application ({{kill PID}}). > h3. Initial analysis > Parquet library tries to care about application shutdown but this introduces > more issues than solves. If application is writing to a file and the library > asynchronously decides to close underlying writer, data loss will occur. The > handle is just closed and all remaining records can't be written. > {noformat} > Writing to HDFS/Parquet failed > java.io.IOException: can not write PageHeader(type:DICTIONARY_PAGE, > uncompressed_page_size:14, compressed_page_size:34, > dictionary_page_header:DictionaryPageHeader(num_values:1, encoding:PLAIN)) > at org.apache.parquet.format.Util.write(Util.java:224) > at org.apache.parquet.format.Util.writePageHeader(Util.java:61) > at > org.apache.parquet.format.converter.ParquetMetadataConverter.writeDictionaryPageHeader(ParquetMetadataConverter.java:760) > at > org.apache.parquet.hadoop.ParquetFileWriter.writeDictionaryPage(ParquetFileWriter.java:307) > at > org.apache.parquet.hadoop.ColumnChunkPageWriteStore$ColumnChunkPageWriter.writeToFileWriter(ColumnChunkPageWriteStore.java:179) > at > org.apache.parquet.hadoop.ColumnChunkPageWriteStore.flushToFileWriter(ColumnChunkPageWriteStore.java:238) > at > org.apache.parquet.hadoop.InternalParquetRecordWriter.flushRowGroupToStore(InternalParquetRecordWriter.java:165) > at > org.apache.parquet.hadoop.InternalParquetRecordWriter.close(InternalParquetRecordWriter.java:113) > at org.apache.parquet.hadoop.ParquetWriter.close(ParquetWriter.java:297) > at > com.avast.bugreport.parquet.ParquetBrokenShutdown.writeParquetFile(ParquetBrokenShutdown.java:86) > at > com.avast.bugreport.parquet.ParquetBrokenShutdown.run(ParquetBrokenShutdown.java:53) > at > com.avast.bugreport.parquet.ParquetBrokenShutdown.main(ParquetBrokenShutdown.java:153) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:497) > at com.intellij.rt.execution.application.AppMain.main(AppMain.java:144) > Caused by: parquet.org.apache.thrift.transport.TTransportException: > java.nio.channels.ClosedChannelException > at >
[jira] [Commented] (PARQUET-408) Shutdown hook in parquet-avro library corrupts data and disables logging
[ https://issues.apache.org/jira/browse/PARQUET-408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15148323#comment-15148323 ] Michal Turek commented on PARQUET-408: -- I have just tried to add `log4j.xml` and `log4j2.xml` on classpath but it didn't help as expected. > Shutdown hook in parquet-avro library corrupts data and disables logging > > > Key: PARQUET-408 > URL: https://issues.apache.org/jira/browse/PARQUET-408 > Project: Parquet > Issue Type: Bug > Components: parquet-avro >Affects Versions: 1.8.1 >Reporter: Michal Turek > Fix For: 1.9.0 > > Attachments: parquet-broken-shutdown_2015-12-16.tar.gz > > > Parquet-avro and probably also other Parquet libraries are not well behaved. > It registers a shutdown hook that bypasses application shutdown sequence, > corrupts data written to currently opened Parquet file(s) and disables or > reconfigures slf4j/logback logger so no further log message is visible. > h3. Scope > Our application is a microservice that handles stop request in form of signal > SIGTERM, resp. JVM shutdown hook. If it arrives the application will close > all opened files (writers), release all other resources and gracefully > shutdown. We are swiching from sequence files to Parquet at the moment and > using Maven dependency {{org.apache.parquet:parquet-avro:1.8.1}} which is > current latest version. We are using > {{Runtime.getRuntime().addShutdownHook()}} to handle SIGTERM. > h3. Example code > See archive in attachment. > - Optionally update version of {{hadoop-client}} in {{pom.xml}} to match your > Hadoop. > - Use {{mvn package}} to compile. > - Copy Hadoop configuration XMLs to {{config}} directory. > - Update configuration at the top of {{ParquetBrokenShutdown}} class. > - Execute {{ParquetBrokenShutdown}} class. > - Send SIGTERM to shutdown the application ({{kill PID}}). > h3. Initial analysis > Parquet library tries to care about application shutdown but this introduces > more issues than solves. If application is writing to a file and the library > asynchronously decides to close underlying writer, data loss will occur. The > handle is just closed and all remaining records can't be written. > {noformat} > Writing to HDFS/Parquet failed > java.io.IOException: can not write PageHeader(type:DICTIONARY_PAGE, > uncompressed_page_size:14, compressed_page_size:34, > dictionary_page_header:DictionaryPageHeader(num_values:1, encoding:PLAIN)) > at org.apache.parquet.format.Util.write(Util.java:224) > at org.apache.parquet.format.Util.writePageHeader(Util.java:61) > at > org.apache.parquet.format.converter.ParquetMetadataConverter.writeDictionaryPageHeader(ParquetMetadataConverter.java:760) > at > org.apache.parquet.hadoop.ParquetFileWriter.writeDictionaryPage(ParquetFileWriter.java:307) > at > org.apache.parquet.hadoop.ColumnChunkPageWriteStore$ColumnChunkPageWriter.writeToFileWriter(ColumnChunkPageWriteStore.java:179) > at > org.apache.parquet.hadoop.ColumnChunkPageWriteStore.flushToFileWriter(ColumnChunkPageWriteStore.java:238) > at > org.apache.parquet.hadoop.InternalParquetRecordWriter.flushRowGroupToStore(InternalParquetRecordWriter.java:165) > at > org.apache.parquet.hadoop.InternalParquetRecordWriter.close(InternalParquetRecordWriter.java:113) > at org.apache.parquet.hadoop.ParquetWriter.close(ParquetWriter.java:297) > at > com.avast.bugreport.parquet.ParquetBrokenShutdown.writeParquetFile(ParquetBrokenShutdown.java:86) > at > com.avast.bugreport.parquet.ParquetBrokenShutdown.run(ParquetBrokenShutdown.java:53) > at > com.avast.bugreport.parquet.ParquetBrokenShutdown.main(ParquetBrokenShutdown.java:153) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:497) > at com.intellij.rt.execution.application.AppMain.main(AppMain.java:144) > Caused by: parquet.org.apache.thrift.transport.TTransportException: > java.nio.channels.ClosedChannelException > at > parquet.org.apache.thrift.transport.TIOStreamTransport.write(TIOStreamTransport.java:147) > at > parquet.org.apache.thrift.transport.TTransport.write(TTransport.java:105) > at > parquet.org.apache.thrift.protocol.TCompactProtocol.writeByteDirect(TCompactProtocol.java:424) > at > parquet.org.apache.thrift.protocol.TCompactProtocol.writeByteDirect(TCompactProtocol.java:431) > at > parquet.org.apache.thrift.protocol.TCompactProtocol.writeFieldBeginInternal(TCompactProtocol.java:194) > at >