[jira] [Commented] (PARQUET-531) Can't read past first page in a column

2016-02-16 Thread Deepak Majeti (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15149593#comment-15149593
 ] 

Deepak Majeti commented on PARQUET-531:
---

[~wesmckinn] I will look into this once the PARQUET-499 patch is completed. 
Looks like some variables are not being re-initialized in the scanner.

> Can't read past first page in a column
> --
>
> Key: PARQUET-531
> URL: https://issues.apache.org/jira/browse/PARQUET-531
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cpp
> Environment: Ubuntu Linux 14.04 (no obvious platform dependence), 
> Parquet file created by Apache Spark 1.5.0 on the same platform. 
>Reporter: Spiro Michaylov
>Assignee: Deepak Majeti
> Attachments: 
> part-r-00031-e5d9a4ef-d73e-406c-8c2f-9ad1f20ebf8e.gz.parquet
>
>
> Building the code as of 2/14/2015 and adding the obvious three lines of code 
> to serialized-page.cc to enable the newly added CompressionCodec::GZIP:
> {code}
>  case parquet::CompressionCodec::GZIP:
>decompressor_.reset(new GZipCodec());
>break;
> {code}
> I try to run the parquet_reader example on the column I'm about to attach, 
> which was created by Apache Spark 1.5.0. It works surprisingly well until it 
> hits the end of the first page, where it dies with  
> {quote}
> Parquet error: Value was non-null, but has not been buffered
> {quote}
> I realize you may be reluctant to look at this because (a) the GZip support 
> is new and (b) I had to modify the code to enable it, but actually things 
> seem to decompress just fine (congratulations: this is awesome!): looking at 
> the problem in the debugger and tracing through a bit it seems to me like the 
> buffering is a bit screwed up in general -- some kind of confusion between 
> the buffering at the Scanner and Reader levels. I can reproduce the problem 
> by reading through just a single column too. 
> It fails after 128 rows, which is suspicious given this line in 
> column/scanner.h:
> {code}
> DEFAULT_SCANNER_BATCH_SIZE = 128;
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PARQUET-531) Can't read past first page in a column

2016-02-16 Thread Wes McKinney (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15149557#comment-15149557
 ] 

Wes McKinney commented on PARQUET-531:
--

[~mdeepak] I just tried reading this file after PARQUET-515 and PARQUET-523 
were applied and it appears the bug lies in the Scanner, so we can leave this 
open until we have a test case reproduction

> Can't read past first page in a column
> --
>
> Key: PARQUET-531
> URL: https://issues.apache.org/jira/browse/PARQUET-531
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cpp
> Environment: Ubuntu Linux 14.04 (no obvious platform dependence), 
> Parquet file created by Apache Spark 1.5.0 on the same platform. 
>Reporter: Spiro Michaylov
> Attachments: 
> part-r-00031-e5d9a4ef-d73e-406c-8c2f-9ad1f20ebf8e.gz.parquet
>
>
> Building the code as of 2/14/2015 and adding the obvious three lines of code 
> to serialized-page.cc to enable the newly added CompressionCodec::GZIP:
> {code}
>  case parquet::CompressionCodec::GZIP:
>decompressor_.reset(new GZipCodec());
>break;
> {code}
> I try to run the parquet_reader example on the column I'm about to attach, 
> which was created by Apache Spark 1.5.0. It works surprisingly well until it 
> hits the end of the first page, where it dies with  
> {quote}
> Parquet error: Value was non-null, but has not been buffered
> {quote}
> I realize you may be reluctant to look at this because (a) the GZip support 
> is new and (b) I had to modify the code to enable it, but actually things 
> seem to decompress just fine (congratulations: this is awesome!): looking at 
> the problem in the debugger and tracing through a bit it seems to me like the 
> buffering is a bit screwed up in general -- some kind of confusion between 
> the buffering at the Scanner and Reader levels. I can reproduce the problem 
> by reading through just a single column too. 
> It fails after 128 rows, which is suspicious given this line in 
> column/scanner.h:
> {code}
> DEFAULT_SCANNER_BATCH_SIZE = 128;
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PARQUET-514) Automate coveralls.io updates in Travis CI

2016-02-16 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated PARQUET-514:
-
Description: The repo has been enabled in INFRA-11273, so all that's left 
is to work on the Travis CI build matrix and add coveralls to one of the builds 
(rather than running it for all of them)  (was: This is actually fairly 
seamless -- if you run the codecov upload from within the Travis CI job, you do 
not need the codecov API token. There may be some twiddling to do to set up a 
build matrix that includes only a single job uploading code coverage (right now 
the build matrix is somewhat monolithic and hard to customize in a granular 
way).)
Summary: Automate coveralls.io updates in Travis CI  (was: Automate 
codecov.io updates in Travis CI)

> Automate coveralls.io updates in Travis CI
> --
>
> Key: PARQUET-514
> URL: https://issues.apache.org/jira/browse/PARQUET-514
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: Wes McKinney
>Priority: Minor
>
> The repo has been enabled in INFRA-11273, so all that's left is to work on 
> the Travis CI build matrix and add coveralls to one of the builds (rather 
> than running it for all of them)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (PARQUET-523) Test ColumnReader on a column chunk containing multiple data pages

2016-02-16 Thread Deepak Majeti (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Deepak Majeti resolved PARQUET-523.
---
Resolution: Resolved

Issue resolved by pull request 51
https://github.com/apache/parquet-cpp/pull/51

> Test ColumnReader on a column chunk containing multiple data pages  
> 
>
> Key: PARQUET-523
> URL: https://issues.apache.org/jira/browse/PARQUET-523
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: Wes McKinney
>Assignee: Deepak Majeti
>
> Our test cases currently only cover data containing a single data page



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (PARQUET-515) Add "Reset" to LevelEncoder and LevelDecoder

2016-02-16 Thread Julien Le Dem (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Le Dem resolved PARQUET-515.
---
   Resolution: Fixed
Fix Version/s: cpp-0.1

Issue resolved by pull request 51
[https://github.com/apache/parquet-cpp/pull/51]

> Add "Reset" to LevelEncoder and LevelDecoder
> 
>
> Key: PARQUET-515
> URL: https://issues.apache.org/jira/browse/PARQUET-515
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: Deepak Majeti
>Assignee: Deepak Majeti
> Fix For: cpp-0.1
>
>
> The rle-encoder and rle-decoder classes have a "Reset" method as a quick way 
> to initialize the objects. This method resets the encoder an decoder state to 
> work on a new buffer without the need to create a new object at every DATA 
> PAGE granularity.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PARQUET-514) Automate codecov.io updates in Travis CI

2016-02-16 Thread Wes McKinney (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15149169#comment-15149169
 ] 

Wes McKinney commented on PARQUET-514:
--

I've asked ASF infra to help with this, see INFRA-11273

> Automate codecov.io updates in Travis CI
> 
>
> Key: PARQUET-514
> URL: https://issues.apache.org/jira/browse/PARQUET-514
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: Wes McKinney
>Priority: Minor
>
> This is actually fairly seamless -- if you run the codecov upload from within 
> the Travis CI job, you do not need the codecov API token. There may be some 
> twiddling to do to set up a build matrix that includes only a single job 
> uploading code coverage (right now the build matrix is somewhat monolithic 
> and hard to customize in a granular way).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (PARQUET-518) Review usages of size_t and unsigned integers generally per Google style guide

2016-02-16 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned PARQUET-518:


Assignee: Wes McKinney

> Review usages of size_t and unsigned integers generally per Google style guide
> --
>
> Key: PARQUET-518
> URL: https://issues.apache.org/jira/browse/PARQUET-518
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Minor
>
> The Google style guide recommends generally avoiding unsigned integers for 
> the bugs they can silently introduce. 
> https://google.github.io/styleguide/cppguide.html#Integer_Types



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PARQUET-458) Implement support for DataPageV2

2016-02-16 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated PARQUET-458:
-
Priority: Minor  (was: Major)

> Implement support for DataPageV2
> 
>
> Key: PARQUET-458
> URL: https://issues.apache.org/jira/browse/PARQUET-458
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-cpp
>Reporter: Wes McKinney
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (PARQUET-520) Add support for zero-copy InputStreams on memory-mapped files

2016-02-16 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned PARQUET-520:


Assignee: Wes McKinney

> Add support for zero-copy InputStreams on memory-mapped files
> -
>
> Key: PARQUET-520
> URL: https://issues.apache.org/jira/browse/PARQUET-520
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>
> Noted this while working on PARQUET-497. If we are using a memory-mapped 
> file, then copying data into a {{ScopedInMemoryInputStream}} as we are now is 
> unnecessary and will yield improved performance. Perhaps this should be made 
> a property of the {{InputStream}} (i.e. indicate whether it support zero-copy 
> reads, and the returned buffer does not become invalid after future reads as 
> long as the stream -- the memory map specifically in this example -- is alive)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PARQUET-481) Refactor and expand reader-test

2016-02-16 Thread Deepak Majeti (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15148947#comment-15148947
 ] 

Deepak Majeti commented on PARQUET-481:
---

I am adding few tests to check NULLS for the Scanner as part of PARQUET-532. 
Making sure we don't duplicate work. 

> Refactor and expand reader-test
> ---
>
> Key: PARQUET-481
> URL: https://issues.apache.org/jira/browse/PARQUET-481
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Affects Versions: cpp-0.1
>Reporter: Aliaksei Sandryhaila
>Assignee: Aliaksei Sandryhaila
> Fix For: cpp-0.1
>
>
> reader-test currently tests with a parquet file and only verifies that we can 
> read it, not the correctness of the output.
> Proposed changes:
> - Remove the use of parquet files and use mock objects instead.
> - Move tests for Scanner to scanner-test.cc
> - Get rid of DebugPrint() tests, move to ParquetFilePrinter as a part of 
> PARQUET-508.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (PARQUET-535) Make writeAllFields more efficient in proto-parquet component

2016-02-16 Thread Chen Song (JIRA)
Chen Song created PARQUET-535:
-

 Summary: Make writeAllFields more efficient in proto-parquet 
component
 Key: PARQUET-535
 URL: https://issues.apache.org/jira/browse/PARQUET-535
 Project: Parquet
  Issue Type: Bug
  Components: parquet-mr
Reporter: Chen Song
Assignee: Chen Song
Priority: Minor






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


HashJoin throws ParquetDecodingException with input as ParquetTupleScheme

2016-02-16 Thread Santlal J Gupta
Hi,

I am facing problem while using HashJoin with input using ParquetTupleScheme. I 
have two source taps of which one is using TextDelimited scheme and the other 
source tap is using ParquetTupleScheme. I am performing a HashJoin and writing 
the data as Delimited file. The program runs successfully on local mode but 
when i tried to run it on cluster, it gives following error :

parquet.io.ParquetDecodingException: Can not read value at 0 in block -1 in 
file hdfs://Hostname:8020/user/username/testData/lookup-file.parquet
at 
parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:211)
at 
parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:144)
at 
parquet.hadoop.mapred.DeprecatedParquetInputFormat$RecordReaderWrapper.(DeprecatedParquetInputFormat.java:91)
at 
parquet.hadoop.mapred.DeprecatedParquetInputFormat.getRecordReader(DeprecatedParquetInputFormat.java:42)
at 
cascading.tap.hadoop.io.MultiRecordReaderIterator.makeReader(MultiRecordReaderIterator.java:123)
at 
cascading.tap.hadoop.io.MultiRecordReaderIterator.getNextReader(MultiRecordReaderIterator.java:172)
at 
cascading.tap.hadoop.io.MultiRecordReaderIterator.hasNext(MultiRecordReaderIterator.java:133)
at 
cascading.tuple.TupleEntrySchemeIterator.(TupleEntrySchemeIterator.java:94)
at 
cascading.tap.hadoop.io.HadoopTupleEntrySchemeIterator.(HadoopTupleEntrySchemeIterator.java:49)
at 
cascading.tap.hadoop.io.HadoopTupleEntrySchemeIterator.(HadoopTupleEntrySchemeIterator.java:44)
at cascading.tap.hadoop.Hfs.openForRead(Hfs.java:439)
at cascading.tap.hadoop.Hfs.openForRead(Hfs.java:108)
at cascading.flow.stream.element.SourceStage.map(SourceStage.java:82)
at cascading.flow.stream.element.SourceStage.run(SourceStage.java:66)
at cascading.flow.hadoop.FlowMapper.run(FlowMapper.java:139)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:453)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
Caused by: java.lang.NullPointerException
at 
parquet.hadoop.util.counters.mapred.MapRedCounterAdapter.increment(MapRedCounterAdapter.java:34)
at 
parquet.hadoop.util.counters.BenchmarkCounter.incrementTotalBytes(BenchmarkCounter.java:75)
at 
parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:349)
at 
parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:114)
at 
parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:191)
... 21 more

Below are the UseCase:

public static void main(String[] args) throws IOException {

Configuration conf = new Configuration();

String[] otherArgs;

otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();

String argsString = "";
for (String arg : otherArgs) {
argsString = argsString + " " + arg;
}
System.out.println("After processing arguments are:" + argsString);

Properties properties = new Properties();
properties.putAll(conf.getValByRegex(".*"));

String OutputPath = "testData/BasicEx_Output";
Class types1[] = { String.class, String.class, String.class };
Fields f1 = new Fields("id1", "city1", "state");

Tap source = new Hfs(new TextDelimited(f1, "|", "", types1, false), 
"main-txt-file.dat");
Pipe pipe = new Pipe("ReadWrite");

Scheme pScheme = new ParquetTupleScheme();
Tap source2 = new Hfs(pScheme, "testData/lookup-file.parquet");
Pipe pipe2 = new Pipe("ReadWrite2");

Pipe tokenPipe = new HashJoin(pipe, new Fields("id1"), pipe2, new 
Fields("id"), new LeftJoin());

Tap sink = new Hfs(new TextDelimited(f1, true, "|"), OutputPath, 
SinkMode.REPLACE);

FlowDef flowDef1 = FlowDef.flowDef().addSource(pipe, 
source).addSource(pipe2, source2).addTailSink(tokenPipe,
sink);
new Hadoop2MR1FlowConnector(properties).connect(flowDef1).complete();

}


I have attached the input files for the reference . Please help me in solving 
this issue.

I have asked the same question on cascading google group and below is response 
for it :

André Kelpe




This looks like a bug caused by a wrong assumption in parquet. I fixed
a similar thing 2 years ago in parquet:
https://github.com/Parquet/parquet-mr/pull/388/ Can you check with the
upstream project? It looks like it is their problem and not a problem
in Cascading.

- André
- show 

[jira] [Resolved] (PARQUET-408) Shutdown hook in parquet-avro library corrupts data and disables logging

2016-02-16 Thread Michal Turek (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michal Turek resolved PARQUET-408.
--
   Resolution: Won't Fix
Fix Version/s: (was: 1.9.0)

I'm closing the issue, workaround is known and can be used in the applications 
and PARQUET-401 will introduce a final fix.

> Shutdown hook in parquet-avro library corrupts data and disables logging
> 
>
> Key: PARQUET-408
> URL: https://issues.apache.org/jira/browse/PARQUET-408
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-avro
>Affects Versions: 1.8.1
>Reporter: Michal Turek
>Assignee: Michal Turek
> Attachments: parquet-broken-shutdown_2015-12-16.tar.gz
>
>
> Parquet-avro and probably also other Parquet libraries are not well behaved. 
> It registers a shutdown hook that bypasses application shutdown sequence, 
> corrupts data written to currently opened Parquet file(s) and disables or 
> reconfigures slf4j/logback logger so no further log message is visible.
> h3. Scope
> Our application is a microservice that handles stop request in form of signal 
> SIGTERM, resp. JVM shutdown hook. If it arrives the application will close 
> all opened files (writers), release all other resources and gracefully 
> shutdown. We are swiching from sequence files to Parquet at the moment and 
> using Maven dependency {{org.apache.parquet:parquet-avro:1.8.1}} which is 
> current latest version. We are using 
> {{Runtime.getRuntime().addShutdownHook()}} to handle SIGTERM.
> h3. Example code
> See archive in attachment.
> - Optionally update version of {{hadoop-client}} in {{pom.xml}} to match your 
> Hadoop.
> - Use {{mvn package}} to compile.
> - Copy Hadoop configuration XMLs to {{config}} directory.
> - Update configuration at the top of {{ParquetBrokenShutdown}} class.
> - Execute {{ParquetBrokenShutdown}} class.
> - Send SIGTERM to shutdown the application ({{kill PID}}).
> h3. Initial analysis
> Parquet library tries to care about application shutdown but this introduces 
> more issues than solves. If application is writing to a file and the library 
> asynchronously decides to close underlying writer, data loss will occur. The 
> handle is just closed and all remaining records can't be written.
> {noformat}
> Writing to HDFS/Parquet failed
> java.io.IOException: can not write PageHeader(type:DICTIONARY_PAGE, 
> uncompressed_page_size:14, compressed_page_size:34, 
> dictionary_page_header:DictionaryPageHeader(num_values:1, encoding:PLAIN))
>   at org.apache.parquet.format.Util.write(Util.java:224)
>   at org.apache.parquet.format.Util.writePageHeader(Util.java:61)
>   at 
> org.apache.parquet.format.converter.ParquetMetadataConverter.writeDictionaryPageHeader(ParquetMetadataConverter.java:760)
>   at 
> org.apache.parquet.hadoop.ParquetFileWriter.writeDictionaryPage(ParquetFileWriter.java:307)
>   at 
> org.apache.parquet.hadoop.ColumnChunkPageWriteStore$ColumnChunkPageWriter.writeToFileWriter(ColumnChunkPageWriteStore.java:179)
>   at 
> org.apache.parquet.hadoop.ColumnChunkPageWriteStore.flushToFileWriter(ColumnChunkPageWriteStore.java:238)
>   at 
> org.apache.parquet.hadoop.InternalParquetRecordWriter.flushRowGroupToStore(InternalParquetRecordWriter.java:165)
>   at 
> org.apache.parquet.hadoop.InternalParquetRecordWriter.close(InternalParquetRecordWriter.java:113)
>   at org.apache.parquet.hadoop.ParquetWriter.close(ParquetWriter.java:297)
>   at 
> com.avast.bugreport.parquet.ParquetBrokenShutdown.writeParquetFile(ParquetBrokenShutdown.java:86)
>   at 
> com.avast.bugreport.parquet.ParquetBrokenShutdown.run(ParquetBrokenShutdown.java:53)
>   at 
> com.avast.bugreport.parquet.ParquetBrokenShutdown.main(ParquetBrokenShutdown.java:153)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:497)
>   at com.intellij.rt.execution.application.AppMain.main(AppMain.java:144)
> Caused by: parquet.org.apache.thrift.transport.TTransportException: 
> java.nio.channels.ClosedChannelException
>   at 
> parquet.org.apache.thrift.transport.TIOStreamTransport.write(TIOStreamTransport.java:147)
>   at 
> parquet.org.apache.thrift.transport.TTransport.write(TTransport.java:105)
>   at 
> parquet.org.apache.thrift.protocol.TCompactProtocol.writeByteDirect(TCompactProtocol.java:424)
>   at 
> parquet.org.apache.thrift.protocol.TCompactProtocol.writeByteDirect(TCompactProtocol.java:431)
>   at 
> 

[jira] [Commented] (PARQUET-408) Shutdown hook in parquet-avro library corrupts data and disables logging

2016-02-16 Thread Michal Turek (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15148540#comment-15148540
 ] 

Michal Turek commented on PARQUET-408:
--

Ok, I was debugging for a while and finally found what is happening. Logging is 
not disabled on shutdown at all but shutdown hook of {{java.util.logging}} 
corrupts stdout stream. Messages that go to {{System.out}} using 
{{ConsoleAppender}} are not written, because the stream is closed. On the other 
hand messages that go to a file using {{RollingFileAppender}} are correctly 
there.

Bellow is a detailed analysis if anyone is interested. Note there is a 
workaround at the very end.


Effective logback.xml:

{noformat}




%d{-MM-dd HH:mm:ss.SSS} %-5level %-35logger{35} 
[%thread]: %msg \(%file:%line\)%n%xThrowable{full}




test.log


test.log.%d{-MM-dd}_%i.gz
30

250MB




%d{-MM-dd HH:mm:ss.SSS} %-5level %-45logger{45} 
[%thread]: %msg%n%xThrowable{full}










{noformat}


Output on console or in JIdea IDE:

{noformat}
Logging test: starting application
Logging test: application started
2016-02-16 12:39:58.922 INFO  c.a.b.parquet.ParquetBrokenShutdown [main]: 
=== STARTING APPLICATION === 
(ParquetBrokenShutdown.java:177)
2016-02-16 12:39:58.928 TRACE c.a.b.parquet.ParquetBrokenShutdown [main]: 
Logging test: starting application (ParquetBrokenShutdown.java:178)
2016-02-16 12:39:58.928 DEBUG c.a.b.parquet.ParquetBrokenShutdown [main]: 
Logging test: starting application (ParquetBrokenShutdown.java:179)
2016-02-16 12:39:58.928 INFO  c.a.b.parquet.ParquetBrokenShutdown [main]: 
Logging test: starting application (ParquetBrokenShutdown.java:180)
2016-02-16 12:39:58.928 WARN  c.a.b.parquet.ParquetBrokenShutdown [main]: 
Logging test: starting application (ParquetBrokenShutdown.java:181)
2016-02-16 12:39:58.928 ERROR c.a.b.parquet.ParquetBrokenShutdown [main]: 
Logging test: starting application (ParquetBrokenShutdown.java:182)
2016-02-16 12:39:58.929 INFO  c.a.b.parquet.ParquetBrokenShutdown [main]: 
=== APPLICATION STARTED === 
(ParquetBrokenShutdown.java:177)
2016-02-16 12:39:58.930 TRACE c.a.b.parquet.ParquetBrokenShutdown [main]: 
Logging test: application started (ParquetBrokenShutdown.java:178)
2016-02-16 12:39:58.930 DEBUG c.a.b.parquet.ParquetBrokenShutdown [main]: 
Logging test: application started (ParquetBrokenShutdown.java:179)
2016-02-16 12:39:58.930 INFO  c.a.b.parquet.ParquetBrokenShutdown [main]: 
Logging test: application started (ParquetBrokenShutdown.java:180)
2016-02-16 12:39:58.930 WARN  c.a.b.parquet.ParquetBrokenShutdown [main]: 
Logging test: application started (ParquetBrokenShutdown.java:181)
2016-02-16 12:39:58.931 ERROR c.a.b.parquet.ParquetBrokenShutdown [main]: 
Logging test: application started (ParquetBrokenShutdown.java:182)
2016-02-16 12:39:59.188 DEBUG c.a.b.parquet.ParquetBrokenShutdown [main]: 
Opening Parquet file: hdfs://nameservice1/user/turek/bugreport.parquet 
(ParquetBrokenShutdown.java:95)
2016-02-16 12:39:59.335 WARN  o.a.hadoop.util.NativeCodeLoader[main]: 
Unable to load native-hadoop library for your platform... using builtin-java 
classes where applicable (NativeCodeLoader.java:62)
2016-02-16 12:40:00.228 INFO  o.a.hadoop.io.compress.CodecPool[main]: Got 
brand-new compressor [.gz] (CodecPool.java:151)
2016-02-16 12:40:00.295 DEBUG c.a.b.parquet.ParquetBrokenShutdown [main]: 
Parquet file opened: hdfs://nameservice1/user/turek/bugreport.parquet 
(ParquetBrokenShutdown.java:105)
2016-02-16 12:40:00.297 TRACE c.a.b.parquet.ParquetBrokenShutdown [main]: 
Writing record: {"test": "test value"} (ParquetBrokenShutdown.java:111)
2016-02-16 12:40:01.299 TRACE c.a.b.parquet.ParquetBrokenShutdown [main]: 
Writing record: {"test": "test value"} (ParquetBrokenShutdown.java:111)
2016-02-16 12:40:02.299 TRACE c.a.b.parquet.ParquetBrokenShutdown [main]: 
Writing record: {"test": "test value"} (ParquetBrokenShutdown.java:111)
2016-02-16 12:40:03.301 TRACE c.a.b.parquet.ParquetBrokenShutdown [main]: 
Writing record: {"test": "test value"} (ParquetBrokenShutdown.java:111)
2016-02-16 12:40:03.585 INFO  c.a.b.parquet.ParquetBrokenShutdown [Thread-0]: 
=== STOPPING APPLICATION === 
(ParquetBrokenShutdown.java:177)
Logging test: stopping application
Logging test: application finished
Logging test: shutdown hook finished

Process finished with exit code 143
{noformat}


Output in {{test.log}} file:

{noformat}
2016-02-16 12:39:58.922 INFO  c.a.bugreport.parquet.ParquetBrokenShutdown   
[main]: === STARTING APPLICATION ===
2016-02-16 12:39:58.928 TRACE c.a.bugreport.parquet.ParquetBrokenShutdown   
[main]: Logging test: starting application

[jira] [Assigned] (PARQUET-408) Shutdown hook in parquet-avro library corrupts data and disables logging

2016-02-16 Thread Michal Turek (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michal Turek reassigned PARQUET-408:


Assignee: Michal Turek

> Shutdown hook in parquet-avro library corrupts data and disables logging
> 
>
> Key: PARQUET-408
> URL: https://issues.apache.org/jira/browse/PARQUET-408
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-avro
>Affects Versions: 1.8.1
>Reporter: Michal Turek
>Assignee: Michal Turek
> Fix For: 1.9.0
>
> Attachments: parquet-broken-shutdown_2015-12-16.tar.gz
>
>
> Parquet-avro and probably also other Parquet libraries are not well behaved. 
> It registers a shutdown hook that bypasses application shutdown sequence, 
> corrupts data written to currently opened Parquet file(s) and disables or 
> reconfigures slf4j/logback logger so no further log message is visible.
> h3. Scope
> Our application is a microservice that handles stop request in form of signal 
> SIGTERM, resp. JVM shutdown hook. If it arrives the application will close 
> all opened files (writers), release all other resources and gracefully 
> shutdown. We are swiching from sequence files to Parquet at the moment and 
> using Maven dependency {{org.apache.parquet:parquet-avro:1.8.1}} which is 
> current latest version. We are using 
> {{Runtime.getRuntime().addShutdownHook()}} to handle SIGTERM.
> h3. Example code
> See archive in attachment.
> - Optionally update version of {{hadoop-client}} in {{pom.xml}} to match your 
> Hadoop.
> - Use {{mvn package}} to compile.
> - Copy Hadoop configuration XMLs to {{config}} directory.
> - Update configuration at the top of {{ParquetBrokenShutdown}} class.
> - Execute {{ParquetBrokenShutdown}} class.
> - Send SIGTERM to shutdown the application ({{kill PID}}).
> h3. Initial analysis
> Parquet library tries to care about application shutdown but this introduces 
> more issues than solves. If application is writing to a file and the library 
> asynchronously decides to close underlying writer, data loss will occur. The 
> handle is just closed and all remaining records can't be written.
> {noformat}
> Writing to HDFS/Parquet failed
> java.io.IOException: can not write PageHeader(type:DICTIONARY_PAGE, 
> uncompressed_page_size:14, compressed_page_size:34, 
> dictionary_page_header:DictionaryPageHeader(num_values:1, encoding:PLAIN))
>   at org.apache.parquet.format.Util.write(Util.java:224)
>   at org.apache.parquet.format.Util.writePageHeader(Util.java:61)
>   at 
> org.apache.parquet.format.converter.ParquetMetadataConverter.writeDictionaryPageHeader(ParquetMetadataConverter.java:760)
>   at 
> org.apache.parquet.hadoop.ParquetFileWriter.writeDictionaryPage(ParquetFileWriter.java:307)
>   at 
> org.apache.parquet.hadoop.ColumnChunkPageWriteStore$ColumnChunkPageWriter.writeToFileWriter(ColumnChunkPageWriteStore.java:179)
>   at 
> org.apache.parquet.hadoop.ColumnChunkPageWriteStore.flushToFileWriter(ColumnChunkPageWriteStore.java:238)
>   at 
> org.apache.parquet.hadoop.InternalParquetRecordWriter.flushRowGroupToStore(InternalParquetRecordWriter.java:165)
>   at 
> org.apache.parquet.hadoop.InternalParquetRecordWriter.close(InternalParquetRecordWriter.java:113)
>   at org.apache.parquet.hadoop.ParquetWriter.close(ParquetWriter.java:297)
>   at 
> com.avast.bugreport.parquet.ParquetBrokenShutdown.writeParquetFile(ParquetBrokenShutdown.java:86)
>   at 
> com.avast.bugreport.parquet.ParquetBrokenShutdown.run(ParquetBrokenShutdown.java:53)
>   at 
> com.avast.bugreport.parquet.ParquetBrokenShutdown.main(ParquetBrokenShutdown.java:153)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:497)
>   at com.intellij.rt.execution.application.AppMain.main(AppMain.java:144)
> Caused by: parquet.org.apache.thrift.transport.TTransportException: 
> java.nio.channels.ClosedChannelException
>   at 
> parquet.org.apache.thrift.transport.TIOStreamTransport.write(TIOStreamTransport.java:147)
>   at 
> parquet.org.apache.thrift.transport.TTransport.write(TTransport.java:105)
>   at 
> parquet.org.apache.thrift.protocol.TCompactProtocol.writeByteDirect(TCompactProtocol.java:424)
>   at 
> parquet.org.apache.thrift.protocol.TCompactProtocol.writeByteDirect(TCompactProtocol.java:431)
>   at 
> parquet.org.apache.thrift.protocol.TCompactProtocol.writeFieldBeginInternal(TCompactProtocol.java:194)
>   at 
> 

[jira] [Comment Edited] (PARQUET-408) Shutdown hook in parquet-avro library corrupts data and disables logging

2016-02-16 Thread Michal Turek (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15148323#comment-15148323
 ] 

Michal Turek edited comment on PARQUET-408 at 2/16/16 10:01 AM:


I have just tried to add `log4j.xml` and `log4j2.xml` on classpath but it 
didn't help as expected.

http://coders-kitchen.com/2014/01/29/tip-disable-log4j-2s-shutdown-handler/

{noformat}




{noformat}

Adding {{import org.apache.logging.log4j.Logger;}} introduces a compiler error 
so I'm pretty sure, log4j2 is not the cause.

Can you update the attached sample code and make it functional to demonstrate 
your fix, please?


was (Author: tu...@avast.com):
I have just tried to add `log4j.xml` and `log4j2.xml` on classpath but it 
didn't help as expected.

> Shutdown hook in parquet-avro library corrupts data and disables logging
> 
>
> Key: PARQUET-408
> URL: https://issues.apache.org/jira/browse/PARQUET-408
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-avro
>Affects Versions: 1.8.1
>Reporter: Michal Turek
> Fix For: 1.9.0
>
> Attachments: parquet-broken-shutdown_2015-12-16.tar.gz
>
>
> Parquet-avro and probably also other Parquet libraries are not well behaved. 
> It registers a shutdown hook that bypasses application shutdown sequence, 
> corrupts data written to currently opened Parquet file(s) and disables or 
> reconfigures slf4j/logback logger so no further log message is visible.
> h3. Scope
> Our application is a microservice that handles stop request in form of signal 
> SIGTERM, resp. JVM shutdown hook. If it arrives the application will close 
> all opened files (writers), release all other resources and gracefully 
> shutdown. We are swiching from sequence files to Parquet at the moment and 
> using Maven dependency {{org.apache.parquet:parquet-avro:1.8.1}} which is 
> current latest version. We are using 
> {{Runtime.getRuntime().addShutdownHook()}} to handle SIGTERM.
> h3. Example code
> See archive in attachment.
> - Optionally update version of {{hadoop-client}} in {{pom.xml}} to match your 
> Hadoop.
> - Use {{mvn package}} to compile.
> - Copy Hadoop configuration XMLs to {{config}} directory.
> - Update configuration at the top of {{ParquetBrokenShutdown}} class.
> - Execute {{ParquetBrokenShutdown}} class.
> - Send SIGTERM to shutdown the application ({{kill PID}}).
> h3. Initial analysis
> Parquet library tries to care about application shutdown but this introduces 
> more issues than solves. If application is writing to a file and the library 
> asynchronously decides to close underlying writer, data loss will occur. The 
> handle is just closed and all remaining records can't be written.
> {noformat}
> Writing to HDFS/Parquet failed
> java.io.IOException: can not write PageHeader(type:DICTIONARY_PAGE, 
> uncompressed_page_size:14, compressed_page_size:34, 
> dictionary_page_header:DictionaryPageHeader(num_values:1, encoding:PLAIN))
>   at org.apache.parquet.format.Util.write(Util.java:224)
>   at org.apache.parquet.format.Util.writePageHeader(Util.java:61)
>   at 
> org.apache.parquet.format.converter.ParquetMetadataConverter.writeDictionaryPageHeader(ParquetMetadataConverter.java:760)
>   at 
> org.apache.parquet.hadoop.ParquetFileWriter.writeDictionaryPage(ParquetFileWriter.java:307)
>   at 
> org.apache.parquet.hadoop.ColumnChunkPageWriteStore$ColumnChunkPageWriter.writeToFileWriter(ColumnChunkPageWriteStore.java:179)
>   at 
> org.apache.parquet.hadoop.ColumnChunkPageWriteStore.flushToFileWriter(ColumnChunkPageWriteStore.java:238)
>   at 
> org.apache.parquet.hadoop.InternalParquetRecordWriter.flushRowGroupToStore(InternalParquetRecordWriter.java:165)
>   at 
> org.apache.parquet.hadoop.InternalParquetRecordWriter.close(InternalParquetRecordWriter.java:113)
>   at org.apache.parquet.hadoop.ParquetWriter.close(ParquetWriter.java:297)
>   at 
> com.avast.bugreport.parquet.ParquetBrokenShutdown.writeParquetFile(ParquetBrokenShutdown.java:86)
>   at 
> com.avast.bugreport.parquet.ParquetBrokenShutdown.run(ParquetBrokenShutdown.java:53)
>   at 
> com.avast.bugreport.parquet.ParquetBrokenShutdown.main(ParquetBrokenShutdown.java:153)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:497)
>   at com.intellij.rt.execution.application.AppMain.main(AppMain.java:144)
> Caused by: parquet.org.apache.thrift.transport.TTransportException: 
> java.nio.channels.ClosedChannelException
>   at 
> 

[jira] [Commented] (PARQUET-408) Shutdown hook in parquet-avro library corrupts data and disables logging

2016-02-16 Thread Michal Turek (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15148323#comment-15148323
 ] 

Michal Turek commented on PARQUET-408:
--

I have just tried to add `log4j.xml` and `log4j2.xml` on classpath but it 
didn't help as expected.

> Shutdown hook in parquet-avro library corrupts data and disables logging
> 
>
> Key: PARQUET-408
> URL: https://issues.apache.org/jira/browse/PARQUET-408
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-avro
>Affects Versions: 1.8.1
>Reporter: Michal Turek
> Fix For: 1.9.0
>
> Attachments: parquet-broken-shutdown_2015-12-16.tar.gz
>
>
> Parquet-avro and probably also other Parquet libraries are not well behaved. 
> It registers a shutdown hook that bypasses application shutdown sequence, 
> corrupts data written to currently opened Parquet file(s) and disables or 
> reconfigures slf4j/logback logger so no further log message is visible.
> h3. Scope
> Our application is a microservice that handles stop request in form of signal 
> SIGTERM, resp. JVM shutdown hook. If it arrives the application will close 
> all opened files (writers), release all other resources and gracefully 
> shutdown. We are swiching from sequence files to Parquet at the moment and 
> using Maven dependency {{org.apache.parquet:parquet-avro:1.8.1}} which is 
> current latest version. We are using 
> {{Runtime.getRuntime().addShutdownHook()}} to handle SIGTERM.
> h3. Example code
> See archive in attachment.
> - Optionally update version of {{hadoop-client}} in {{pom.xml}} to match your 
> Hadoop.
> - Use {{mvn package}} to compile.
> - Copy Hadoop configuration XMLs to {{config}} directory.
> - Update configuration at the top of {{ParquetBrokenShutdown}} class.
> - Execute {{ParquetBrokenShutdown}} class.
> - Send SIGTERM to shutdown the application ({{kill PID}}).
> h3. Initial analysis
> Parquet library tries to care about application shutdown but this introduces 
> more issues than solves. If application is writing to a file and the library 
> asynchronously decides to close underlying writer, data loss will occur. The 
> handle is just closed and all remaining records can't be written.
> {noformat}
> Writing to HDFS/Parquet failed
> java.io.IOException: can not write PageHeader(type:DICTIONARY_PAGE, 
> uncompressed_page_size:14, compressed_page_size:34, 
> dictionary_page_header:DictionaryPageHeader(num_values:1, encoding:PLAIN))
>   at org.apache.parquet.format.Util.write(Util.java:224)
>   at org.apache.parquet.format.Util.writePageHeader(Util.java:61)
>   at 
> org.apache.parquet.format.converter.ParquetMetadataConverter.writeDictionaryPageHeader(ParquetMetadataConverter.java:760)
>   at 
> org.apache.parquet.hadoop.ParquetFileWriter.writeDictionaryPage(ParquetFileWriter.java:307)
>   at 
> org.apache.parquet.hadoop.ColumnChunkPageWriteStore$ColumnChunkPageWriter.writeToFileWriter(ColumnChunkPageWriteStore.java:179)
>   at 
> org.apache.parquet.hadoop.ColumnChunkPageWriteStore.flushToFileWriter(ColumnChunkPageWriteStore.java:238)
>   at 
> org.apache.parquet.hadoop.InternalParquetRecordWriter.flushRowGroupToStore(InternalParquetRecordWriter.java:165)
>   at 
> org.apache.parquet.hadoop.InternalParquetRecordWriter.close(InternalParquetRecordWriter.java:113)
>   at org.apache.parquet.hadoop.ParquetWriter.close(ParquetWriter.java:297)
>   at 
> com.avast.bugreport.parquet.ParquetBrokenShutdown.writeParquetFile(ParquetBrokenShutdown.java:86)
>   at 
> com.avast.bugreport.parquet.ParquetBrokenShutdown.run(ParquetBrokenShutdown.java:53)
>   at 
> com.avast.bugreport.parquet.ParquetBrokenShutdown.main(ParquetBrokenShutdown.java:153)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:497)
>   at com.intellij.rt.execution.application.AppMain.main(AppMain.java:144)
> Caused by: parquet.org.apache.thrift.transport.TTransportException: 
> java.nio.channels.ClosedChannelException
>   at 
> parquet.org.apache.thrift.transport.TIOStreamTransport.write(TIOStreamTransport.java:147)
>   at 
> parquet.org.apache.thrift.transport.TTransport.write(TTransport.java:105)
>   at 
> parquet.org.apache.thrift.protocol.TCompactProtocol.writeByteDirect(TCompactProtocol.java:424)
>   at 
> parquet.org.apache.thrift.protocol.TCompactProtocol.writeByteDirect(TCompactProtocol.java:431)
>   at 
> parquet.org.apache.thrift.protocol.TCompactProtocol.writeFieldBeginInternal(TCompactProtocol.java:194)
>   at 
>