[jira] [Created] (PARQUET-2074) Upgrade to JDK 9+
David Mollitor created PARQUET-2074: --- Summary: Upgrade to JDK 9+ Key: PARQUET-2074 URL: https://issues.apache.org/jira/browse/PARQUET-2074 Project: Parquet Issue Type: Improvement Reporter: David Mollitor Moving to JDK 9 will provide a plethora of new compares/equals capabilities on arrays that are all based on vectorization and implement {{\@IntrinsicCandidate}} https://docs.oracle.com/javase/9/docs/api/java/util/Arrays.html -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (PARQUET-2072) Do Not Determine Both Min/Max for Binary Stats
David Mollitor created PARQUET-2072: --- Summary: Do Not Determine Both Min/Max for Binary Stats Key: PARQUET-2072 URL: https://issues.apache.org/jira/browse/PARQUET-2072 Project: Parquet Issue Type: Improvement Reporter: David Mollitor Assignee: David Mollitor I'm looking at some benchmarking code of Apache ORC v.s. Apache Parquet and see that Parquet is quite a bit slower for writes (reads TBD). Based on my investigation, I have noticed a significant amount of time spent in determining min/max for binary types. One quick improvement is to bypass a "max" value determinization if the value has already been determined to be a "min". While I'm at it, remove calls to deprecated functions. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (PARQUET-2063) Remove Compile Warnings from MemoryManager
David Mollitor created PARQUET-2063: --- Summary: Remove Compile Warnings from MemoryManager Key: PARQUET-2063 URL: https://issues.apache.org/jira/browse/PARQUET-2063 Project: Parquet Issue Type: Improvement Components: parquet-mr Reporter: David Mollitor Assignee: David Mollitor -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (PARQUET-2048) Deprecate BaseRecordReader
[ https://issues.apache.org/jira/browse/PARQUET-2048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Mollitor updated PARQUET-2048: Summary: Deprecate BaseRecordReader (was: Remove BaseRecordReader) > Deprecate BaseRecordReader > -- > > Key: PARQUET-2048 > URL: https://issues.apache.org/jira/browse/PARQUET-2048 > Project: Parquet > Issue Type: Improvement > Reporter: David Mollitor > Assignee: David Mollitor >Priority: Minor > > No longer used. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (PARQUET-2048) Remove BaseRecordReader
David Mollitor created PARQUET-2048: --- Summary: Remove BaseRecordReader Key: PARQUET-2048 URL: https://issues.apache.org/jira/browse/PARQUET-2048 Project: Parquet Issue Type: Improvement Reporter: David Mollitor Assignee: David Mollitor No longer used. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (PARQUET-2047) Clean Up Code
[ https://issues.apache.org/jira/browse/PARQUET-2047?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Mollitor updated PARQUET-2047: Description: * Removed unused code * Remove unused imports * Add @Override annotations Mostly throwing away superfluous stuff. Less is more. was: * Removed unused code * Remove unused imports * Add \@Override annotations > Clean Up Code > - > > Key: PARQUET-2047 > URL: https://issues.apache.org/jira/browse/PARQUET-2047 > Project: Parquet > Issue Type: Improvement > Reporter: David Mollitor > Assignee: David Mollitor >Priority: Minor > > * Removed unused code > * Remove unused imports > * Add @Override annotations > Mostly throwing away superfluous stuff. Less is more. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (PARQUET-2047) Clean Up Code
David Mollitor created PARQUET-2047: --- Summary: Clean Up Code Key: PARQUET-2047 URL: https://issues.apache.org/jira/browse/PARQUET-2047 Project: Parquet Issue Type: Improvement Reporter: David Mollitor Assignee: David Mollitor * Removed unused code * Remove unused imports * Add \@Override annotations -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (PARQUET-2046) Upgrade Apache POM to 23
David Mollitor created PARQUET-2046: --- Summary: Upgrade Apache POM to 23 Key: PARQUET-2046 URL: https://issues.apache.org/jira/browse/PARQUET-2046 Project: Parquet Issue Type: Improvement Reporter: David Mollitor Assignee: David Mollitor -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (PARQUET-1666) Remove Unused Modules
[ https://issues.apache.org/jira/browse/PARQUET-1666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17259913#comment-17259913 ] David Mollitor commented on PARQUET-1666: - Shouldn't this be a Parquet-MR 2.0 action? > Remove Unused Modules > -- > > Key: PARQUET-1666 > URL: https://issues.apache.org/jira/browse/PARQUET-1666 > Project: Parquet > Issue Type: Improvement > Components: parquet-mr >Affects Versions: 1.12.0 >Reporter: Xinli Shang >Priority: Major > Fix For: 1.12.0 > > > In the last two meetings, Ryan Blue proposed to remove some unused Parquet > modules. This is to open a task to track it. > Here are the related meeting notes for the discussion on this. > Remove old Parquet modules > Hive modules - sounds good > Scooge - Julien will reach out to twitter > Tools - undecided - Cloudera may still use the parquet-tools according to > Gabor. > Cascading - undecided > We can change the module as deprecated as description. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (PARQUET-1126) make it easy to read and write parquet files in java without depending on hadoop
[ https://issues.apache.org/jira/browse/PARQUET-1126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17249875#comment-17249875 ] David Mollitor commented on PARQUET-1126: - Also check out some work done (Waiting in GitHub PR) [PARQUET-1776] > make it easy to read and write parquet files in java without depending on > hadoop > > > Key: PARQUET-1126 > URL: https://issues.apache.org/jira/browse/PARQUET-1126 > Project: Parquet > Issue Type: Improvement >Reporter: Oscar Boykin >Priority: Major > > I am happy to help with this but I'd love some guidance on: > 1) likelihood of being accepted as a patch. > 2) how critical it is to maintain backwards compatibility in APIs. > For instance, we probably want to introduce a new artifact that lives under > the existing hadoop depending artifact, and move as much code as possible to > that, keeping the hadoop apis in the old artifact. > Welcome comments on solving this issue. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (PARQUET-1925) Introduce Velocity Template Engine to Parquet Generator
David Mollitor created PARQUET-1925: --- Summary: Introduce Velocity Template Engine to Parquet Generator Key: PARQUET-1925 URL: https://issues.apache.org/jira/browse/PARQUET-1925 Project: Parquet Issue Type: New Feature Reporter: David Mollitor Assignee: David Mollitor Much easier than the current setup of manually outputting the strings. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (PARQUET-1924) Do not Instantiate a New LongHashFunction
[ https://issues.apache.org/jira/browse/PARQUET-1924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Mollitor updated PARQUET-1924: Description: {code:java|title=XxHash.java} /** * The implementation of HashFunction interface. The XxHash uses XXH64 version xxHash * with a seed of 0. */ public class XxHash implements HashFunction { @Override public long hashBytes(byte[] input) { return LongHashFunction.xx(0).hashBytes(input); } @Override public long hashByteBuffer(ByteBuffer input) { return LongHashFunction.xx(0).hashBytes(input); } {code} Since the seed is always zero, the {{static}} implementation provided by the library can be used here. was: {code:java|title=XxHash.java} |/**| | | * The implementation of HashFunction interface. The XxHash uses XXH64 version xxHash| | | * with a seed of 0.| | | */| | |public class XxHash implements HashFunction {| | |@Override| | |public long hashBytes(byte[] input) {| | |return LongHashFunction.xx(0).hashBytes(input);| | |}| | | | | |@Override| | |public long hashByteBuffer(ByteBuffer input) {| | |return LongHashFunction.xx(0).hashBytes(input);| | |}| > Do not Instantiate a New LongHashFunction > -- > > Key: PARQUET-1924 > URL: https://issues.apache.org/jira/browse/PARQUET-1924 > Project: Parquet > Issue Type: Improvement > Reporter: David Mollitor > Assignee: David Mollitor >Priority: Minor > > {code:java|title=XxHash.java} > /** > * The implementation of HashFunction interface. The XxHash uses XXH64 > version xxHash > * with a seed of 0. > */ > public class XxHash implements HashFunction { > @Override > public long hashBytes(byte[] input) { > return LongHashFunction.xx(0).hashBytes(input); > } > @Override > public long hashByteBuffer(ByteBuffer input) { > return LongHashFunction.xx(0).hashBytes(input); > } > {code} > Since the seed is always zero, the {{static}} implementation provided by the > library can be used here. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (PARQUET-1924) Do not Instantiate a New LongHashFunction
David Mollitor created PARQUET-1924: --- Summary: Do not Instantiate a New LongHashFunction Key: PARQUET-1924 URL: https://issues.apache.org/jira/browse/PARQUET-1924 Project: Parquet Issue Type: Improvement Reporter: David Mollitor Assignee: David Mollitor {code:java|title=XxHash.java} |/**| | | * The implementation of HashFunction interface. The XxHash uses XXH64 version xxHash| | | * with a seed of 0.| | | */| | |public class XxHash implements HashFunction {| | |@Override| | |public long hashBytes(byte[] input) {| | |return LongHashFunction.xx(0).hashBytes(input);| | |}| | | | | |@Override| | |public long hashByteBuffer(ByteBuffer input) {| | |return LongHashFunction.xx(0).hashBytes(input);| | |}| -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (PARQUET-1922) Deprecate IOExceptionUtils
David Mollitor created PARQUET-1922: --- Summary: Deprecate IOExceptionUtils Key: PARQUET-1922 URL: https://issues.apache.org/jira/browse/PARQUET-1922 Project: Parquet Issue Type: Improvement Reporter: David Mollitor Assignee: David Mollitor -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (PARQUET-1921) Use StringBuilder instead of StringBuffer
David Mollitor created PARQUET-1921: --- Summary: Use StringBuilder instead of StringBuffer Key: PARQUET-1921 URL: https://issues.apache.org/jira/browse/PARQUET-1921 Project: Parquet Issue Type: Improvement Components: parquet-mr Reporter: David Mollitor {code:java|title=MessageTypeParser.java} private StringBuffer currentLine = new StringBuffer(); public String nextToken() { while (st.hasMoreTokens()) { String t = st.nextToken(); if (t.equals("\n")) { ++ line; currentLine.setLength(0); } else { currentLine.append(t); } if (!isWhitespace(t)) { return t; } } throw new IllegalArgumentException("unexpected end of schema"); } {code} Use {{StringBuilder}} instead of {{StringBuffer}} as {{StringBuffer}} is synchronized (which is not required here). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (PARQUET-1918) Avoid Copy of Bytes in Protobuf BinaryWriter
[ https://issues.apache.org/jira/browse/PARQUET-1918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17206414#comment-17206414 ] David Mollitor edited comment on PARQUET-1918 at 10/2/20, 7:52 PM: --- Unit tests fail with: Trying to address with THRIFT-5288 {code:java} java.lang.Exception: java.nio.ReadOnlyBufferException at org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:522) Caused by: java.nio.ReadOnlyBufferException at java.nio.ByteBuffer.array(ByteBuffer.java:996) at shaded.parquet.org.apache.thrift.protocol.TCompactProtocol.writeBinary(TCompactProtocol.java:375) at org.apache.parquet.format.InterningProtocol.writeBinary(InterningProtocol.java:135) at org.apache.parquet.format.ColumnIndex$ColumnIndexStandardScheme.write(ColumnIndex.java:945) at org.apache.parquet.format.ColumnIndex$ColumnIndexStandardScheme.write(ColumnIndex.java:820) at org.apache.parquet.format.ColumnIndex.write(ColumnIndex.java:728) at org.apache.parquet.format.Util.write(Util.java:372) at org.apache.parquet.format.Util.writeColumnIndex(Util.java:69) at org.apache.parquet.hadoop.ParquetFileWriter.serializeColumnIndexes(ParquetFileWriter.java:1087) at org.apache.parquet.hadoop.ParquetFileWriter.end(ParquetFileWriter.java:1050) {code} was (Author: belugabehr): Unit tests fail with: {code:java} java.lang.Exception: java.nio.ReadOnlyBufferException at org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:522) Caused by: java.nio.ReadOnlyBufferException at java.nio.ByteBuffer.array(ByteBuffer.java:996) at shaded.parquet.org.apache.thrift.protocol.TCompactProtocol.writeBinary(TCompactProtocol.java:375) at org.apache.parquet.format.InterningProtocol.writeBinary(InterningProtocol.java:135) at org.apache.parquet.format.ColumnIndex$ColumnIndexStandardScheme.write(ColumnIndex.java:945) at org.apache.parquet.format.ColumnIndex$ColumnIndexStandardScheme.write(ColumnIndex.java:820) at org.apache.parquet.format.ColumnIndex.write(ColumnIndex.java:728) at org.apache.parquet.format.Util.write(Util.java:372) at org.apache.parquet.format.Util.writeColumnIndex(Util.java:69) at org.apache.parquet.hadoop.ParquetFileWriter.serializeColumnIndexes(ParquetFileWriter.java:1087) at org.apache.parquet.hadoop.ParquetFileWriter.end(ParquetFileWriter.java:1050) {code} > Avoid Copy of Bytes in Protobuf BinaryWriter > > > Key: PARQUET-1918 > URL: https://issues.apache.org/jira/browse/PARQUET-1918 > Project: Parquet > Issue Type: Improvement > Reporter: David Mollitor > Assignee: David Mollitor >Priority: Minor > > {code:java|title=ProtoWriteSupport.java} > class BinaryWriter extends FieldWriter { > @Override > final void writeRawValue(Object value) { > ByteString byteString = (ByteString) value; > Binary binary = Binary.fromConstantByteArray(byteString.toByteArray()); > recordConsumer.addBinary(binary); > } > } > {code} > {{toByteArray()}} creates a copy of the buffer. There is already support > with Parquet and Protobuf to pass instead a ByteBuffer which avoids the copy. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (PARQUET-1918) Avoid Copy of Bytes in Protobuf BinaryWriter
[ https://issues.apache.org/jira/browse/PARQUET-1918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17206414#comment-17206414 ] David Mollitor commented on PARQUET-1918: - Unit tests fail with: {code:java} java.lang.Exception: java.nio.ReadOnlyBufferException at org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:522) Caused by: java.nio.ReadOnlyBufferException at java.nio.ByteBuffer.array(ByteBuffer.java:996) at shaded.parquet.org.apache.thrift.protocol.TCompactProtocol.writeBinary(TCompactProtocol.java:375) at org.apache.parquet.format.InterningProtocol.writeBinary(InterningProtocol.java:135) at org.apache.parquet.format.ColumnIndex$ColumnIndexStandardScheme.write(ColumnIndex.java:945) at org.apache.parquet.format.ColumnIndex$ColumnIndexStandardScheme.write(ColumnIndex.java:820) at org.apache.parquet.format.ColumnIndex.write(ColumnIndex.java:728) at org.apache.parquet.format.Util.write(Util.java:372) at org.apache.parquet.format.Util.writeColumnIndex(Util.java:69) at org.apache.parquet.hadoop.ParquetFileWriter.serializeColumnIndexes(ParquetFileWriter.java:1087) at org.apache.parquet.hadoop.ParquetFileWriter.end(ParquetFileWriter.java:1050) {code} > Avoid Copy of Bytes in Protobuf BinaryWriter > > > Key: PARQUET-1918 > URL: https://issues.apache.org/jira/browse/PARQUET-1918 > Project: Parquet > Issue Type: Improvement > Reporter: David Mollitor > Assignee: David Mollitor >Priority: Minor > > {code:java|title=ProtoWriteSupport.java} > class BinaryWriter extends FieldWriter { > @Override > final void writeRawValue(Object value) { > ByteString byteString = (ByteString) value; > Binary binary = Binary.fromConstantByteArray(byteString.toByteArray()); > recordConsumer.addBinary(binary); > } > } > {code} > {{toByteArray()}} creates a copy of the buffer. There is already support > with Parquet and Protobuf to pass instead a ByteBuffer which avoids the copy. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Moved] (PARQUET-1918) Avoid Copy of Bytes in Protobuf BinaryWriter
[ https://issues.apache.org/jira/browse/PARQUET-1918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Mollitor moved HIVE-24226 to PARQUET-1918: Key: PARQUET-1918 (was: HIVE-24226) Workflow: patch-available, re-open possible (was: no-reopen-closed, patch-avail) Project: Parquet (was: Hive) > Avoid Copy of Bytes in Protobuf BinaryWriter > > > Key: PARQUET-1918 > URL: https://issues.apache.org/jira/browse/PARQUET-1918 > Project: Parquet > Issue Type: Improvement > Reporter: David Mollitor > Assignee: David Mollitor >Priority: Minor > > {code:java|title=ProtoWriteSupport.java} > class BinaryWriter extends FieldWriter { > @Override > final void writeRawValue(Object value) { > ByteString byteString = (ByteString) value; > Binary binary = Binary.fromConstantByteArray(byteString.toByteArray()); > recordConsumer.addBinary(binary); > } > } > {code} > {{toByteArray()}} creates a copy of the buffer. There is already support > with Parquet and Protobuf to pass instead a ByteBuffer which avoids the copy. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (PARQUET-1914) Allow ProtoParquetReader To Support InputFile
[ https://issues.apache.org/jira/browse/PARQUET-1914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17199592#comment-17199592 ] David Mollitor commented on PARQUET-1914: - {{ProtoParquetReader.Builder}} should extends {{ParquetReader.Builder}} to correctly override {{getReadSupport()}} and to allow for using an {{InputFile}} in addition to the previously supported {{Path}}. The usage pattern here is a bit confusing and I wanted to update the {{ParquetReader.Builder}} directly, but I think this is the way it is intended. > Allow ProtoParquetReader To Support InputFile > - > > Key: PARQUET-1914 > URL: https://issues.apache.org/jira/browse/PARQUET-1914 > Project: Parquet > Issue Type: Improvement > Reporter: David Mollitor > Assignee: David Mollitor >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (PARQUET-1913) ParquetReader Should Support InputFile
[ https://issues.apache.org/jira/browse/PARQUET-1913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Mollitor resolved PARQUET-1913. - Resolution: Won't Fix > ParquetReader Should Support InputFile > -- > > Key: PARQUET-1913 > URL: https://issues.apache.org/jira/browse/PARQUET-1913 > Project: Parquet > Issue Type: Improvement > Reporter: David Mollitor > Assignee: David Mollitor >Priority: Major > > When creating a {{ParquetReader}}, a "read support" object is required. > https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetReader.java#L325-L330 > However, when building from an {{InputFile}}, 'readSupport' is always 'null' > and therefore will never work. > https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetReader.java#L202 > Add the read support option just as is done with a {{Path}} object. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (PARQUET-1914) Allow ProtoParquetReader To Support InputFile
David Mollitor created PARQUET-1914: --- Summary: Allow ProtoParquetReader To Support InputFile Key: PARQUET-1914 URL: https://issues.apache.org/jira/browse/PARQUET-1914 Project: Parquet Issue Type: Improvement Reporter: David Mollitor Assignee: David Mollitor -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (PARQUET-1913) ParquetReader Should Support InputFile
David Mollitor created PARQUET-1913: --- Summary: ParquetReader Should Support InputFile Key: PARQUET-1913 URL: https://issues.apache.org/jira/browse/PARQUET-1913 Project: Parquet Issue Type: Improvement Reporter: David Mollitor Assignee: David Mollitor When creating a {{ParquetReader}}, a "read support" object is required. https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetReader.java#L325-L330 However, when building from an {{InputFile}}, 'readSupport' is always 'null' and therefore will never work. https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetReader.java#L202 Add the read support option just as is done with a {{Path}} object. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (PARQUET-1822) Parquet without Hadoop dependencies
[ https://issues.apache.org/jira/browse/PARQUET-1822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17188573#comment-17188573 ] David Mollitor commented on PARQUET-1822: - Parquet 2.0 anyone? > Parquet without Hadoop dependencies > --- > > Key: PARQUET-1822 > URL: https://issues.apache.org/jira/browse/PARQUET-1822 > Project: Parquet > Issue Type: Improvement > Components: parquet-avro >Affects Versions: 1.11.0 > Environment: Amazon Fargate (linux), Windows development box. > We are writing Parquet to be read by the Snowflake and Athena databases. >Reporter: mark juchems >Priority: Minor > Labels: documentation, newbie > > I have been trying for weeks to create a parquet file from avro and write to > S3 in Java. This has been incredibly frustrating and odd as Spark can do it > easily (I'm told). > I have assembled the correct jars through luck and diligence, but now I find > out that I have to have hadoop installed on my machine. I am currently > developing in Windows and it seems a dll and exe can fix that up but am > wondering about Linus as the code will eventually run in Fargate on AWS. > *Why do I need external dependencies and not pure java?* > The thing really is how utterly complex all this is. I would like to create > an avro file and convert it to Parquet and write it to S3, but I am trapped > in "ParquetWriter" hell! > *Why can't I get a normal OutputStream and write it wherever I want?* > I have scoured the web for examples and there are a few but we really need > some documentation on this stuff. I understand that there may be reasons for > all this but I can't find them on the web anywhere. Any help? Can't we get > the "SimpleParquet" jar that does this: > > ParquetWriter writer = > AvroParquetWriter.builder(outputStream) > .withSchema(avroSchema) > .withConf(conf) > .withCompressionCodec(CompressionCodecName.SNAPPY) > .withWriteMode(Mode.OVERWRITE)//probably not good for prod. (overwrites > files). > .build(); > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (PARQUET-1905) Use SeekableByteChannel instead of OutputFile/InputFile Classes
[ https://issues.apache.org/jira/browse/PARQUET-1905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17186108#comment-17186108 ] David Mollitor commented on PARQUET-1905: - Also gets rid of {{PositionOutputStream}} > Use SeekableByteChannel instead of OutputFile/InputFile Classes > --- > > Key: PARQUET-1905 > URL: https://issues.apache.org/jira/browse/PARQUET-1905 > Project: Parquet > Issue Type: Improvement > Reporter: David Mollitor >Priority: Major > Fix For: 2.0.0 > > > Use Java NIO SeekableByteChannel for input to reader/writer instead of the > current Parquet-only {{Output}}/{{InputFile}} Classes -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (PARQUET-1905) Use SeekableByteChannel instead of OutputFile/InputFile Classes
David Mollitor created PARQUET-1905: --- Summary: Use SeekableByteChannel instead of OutputFile/InputFile Classes Key: PARQUET-1905 URL: https://issues.apache.org/jira/browse/PARQUET-1905 Project: Parquet Issue Type: Improvement Reporter: David Mollitor Fix For: 2.0.0 Use Java NIO SeekableByteChannel instead of the current Parquet-only {{Output}}/{{InputFile}} Classes -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (PARQUET-1905) Use SeekableByteChannel instead of OutputFile/InputFile Classes
[ https://issues.apache.org/jira/browse/PARQUET-1905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Mollitor updated PARQUET-1905: Description: Use Java NIO SeekableByteChannel for input to reader/writer instead of the current Parquet-only {{Output}}/{{InputFile}} Classes (was: Use Java NIO SeekableByteChannel instead of the current Parquet-only {{Output}}/{{InputFile}} Classes) > Use SeekableByteChannel instead of OutputFile/InputFile Classes > --- > > Key: PARQUET-1905 > URL: https://issues.apache.org/jira/browse/PARQUET-1905 > Project: Parquet > Issue Type: Improvement > Reporter: David Mollitor >Priority: Major > Fix For: 2.0.0 > > > Use Java NIO SeekableByteChannel for input to reader/writer instead of the > current Parquet-only {{Output}}/{{InputFile}} Classes -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (PARQUET-1903) Improve Parquet Protobuf Usability
[ https://issues.apache.org/jira/browse/PARQUET-1903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Mollitor updated PARQUET-1903: Description: Check out the PR for details. * Move away from passing around a {{Class}} object to take advantage of Java Templating * Make parquet-proto library more usable and straight-forward * Provide test examples * Limited support for protocol buffer schema registry was: Check out the PR for details. * Move away from passing around a {{Class}} object to take advantage of Java Templating * Make parquet-proto library more usable and straight-forward * Provide test examples > Improve Parquet Protobuf Usability > -- > > Key: PARQUET-1903 > URL: https://issues.apache.org/jira/browse/PARQUET-1903 > Project: Parquet > Issue Type: Improvement > Reporter: David Mollitor > Assignee: David Mollitor >Priority: Major > > Check out the PR for details. > > * Move away from passing around a {{Class}} object to take advantage of Java > Templating > * Make parquet-proto library more usable and straight-forward > * Provide test examples > * Limited support for protocol buffer schema registry > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (PARQUET-1903) Improve Parquet Protobuf Usability
David Mollitor created PARQUET-1903: --- Summary: Improve Parquet Protobuf Usability Key: PARQUET-1903 URL: https://issues.apache.org/jira/browse/PARQUET-1903 Project: Parquet Issue Type: Improvement Reporter: David Mollitor Assignee: David Mollitor Check out the PR for details. * Move away from passing around a {{Class}} object to take advantage of Java Templating * Make parquet-proto library more usable and straight-forward * Provide test examples -- This message was sent by Atlassian Jira (v8.3.4#803005)
Re: Finding Max Value of Column
Hey Gabor, I appreciate you sharing your knowledge with me. As I understand it, my solution is acceptable but is not the generalized solution. What would that solution look like? Thanks. On Tue, Mar 10, 2020, 4:55 AM Gabor Szadovszky wrote: > Hi, > > Statistics objects are mainly created for internal use. The check you > mentioned is to ensure that only the corresponding column statistics are > summarized. > The code you've written works properly because you create and use the > Statistics object as we use it internally. However, it is quite easy to > misuse it. > It is also worth mentioning that the code works properly because your type > is an INT64. In case of some other types (e.g. FLOAT, DOUBLE, BINARY) it > would not always be that trivial. > So, if this code works for your case you may use it but I would not suggest > generalizing it for other cases and neither would suggest extending the > existing code to support it. > > Regards, > Gabor > > On Mon, Mar 9, 2020 at 4:12 PM David Mollitor wrote: > > > Hello, > > > > One thing that would have made this even easier... the 'mergeStatsistics' > > method throws an exception if the columns are not equal on the RHS/LHS of > > the method. I had to add that toDotString check to avoid this > scenario. I > > could have just caught (and ignored) that exception to remove that extra > > check, but the overhead would have been heavy, and it would have added > even > > more code. > > > > The 'mergeStatistics' method is already doing a comparison check > internally > > (that's why it throws an exception), is there any interest in adding a > new > > method signature that returns true/false if the merge was successful, > > instead of throwing an exception? > > > > Then the code just becomes: > > > > for (final BlockMetaData rowGroup : reader.getRowGroups()) { > > for (final ColumnChunkMetaData column : rowGroup.getColumns()) { > > boolean success = > > stats.mergeStatistics(column.getStatistics()); > > } > > } > > > > > > > > On Mon, Mar 9, 2020 at 10:58 AM Gabor Szadovszky > > wrote: > > > > > Hi David, > > > > > > Your code looks good to me. As you are using INT64, min/max truncate > does > > > not apply. I think, it should work fine. > > > > > > Cheers, > > > Gabor > > > > > > On Mon, Mar 9, 2020 at 3:42 PM David Mollitor > wrote: > > > > > > > Hello Gang, > > > > > > > > I am trying to build an application. One function it has is to scan > a > > > > directory of Parquet files and then determine the maximum "sequence > > > number" > > > > (id) across all files. This is the solution I came up with, but is > > this > > > > correct? How would you do such a thing? > > > > > > > > I wrote the files with parquet-avro writer. > > > > > > > > try (DirectoryStream directoryStream = > > > > Files.newDirectoryStream(Paths.get("/tmp/parq-files"), filter)) { > > > > > > > > PrimitiveType type = > > > > Types.required(PrimitiveTypeName.INT64).named("seq"); > > > > Statistics stats = > Statistics.getBuilderForReading(type).build(); > > > > > > > > for (java.nio.file.Path path : directoryStream) { > > > > ParquetFileReader reader = > > > > ParquetFileReader.open(HadoopInputFile.fromPath(new > Path(path.toUri()), > > > new > > > > Configuration())); > > > > > > > > for (final BlockMetaData rowGroup : reader.getRowGroups()) { > > > > for (final ColumnChunkMetaData column : rowGroup.getColumns()) > { > > > > if ("seq".equals(column.getPath().toDotString())) { > > > > stats.mergeStatistics(column.getStatistics()); > > > > } > > > > } > > > >} > > > > } > > > > > > > > Thanks. > > > > > > > > > >
Re: Finding Max Value of Column
Hello, One thing that would have made this even easier... the 'mergeStatsistics' method throws an exception if the columns are not equal on the RHS/LHS of the method. I had to add that toDotString check to avoid this scenario. I could have just caught (and ignored) that exception to remove that extra check, but the overhead would have been heavy, and it would have added even more code. The 'mergeStatistics' method is already doing a comparison check internally (that's why it throws an exception), is there any interest in adding a new method signature that returns true/false if the merge was successful, instead of throwing an exception? Then the code just becomes: for (final BlockMetaData rowGroup : reader.getRowGroups()) { for (final ColumnChunkMetaData column : rowGroup.getColumns()) { boolean success = stats.mergeStatistics(column.getStatistics()); } } On Mon, Mar 9, 2020 at 10:58 AM Gabor Szadovszky wrote: > Hi David, > > Your code looks good to me. As you are using INT64, min/max truncate does > not apply. I think, it should work fine. > > Cheers, > Gabor > > On Mon, Mar 9, 2020 at 3:42 PM David Mollitor wrote: > > > Hello Gang, > > > > I am trying to build an application. One function it has is to scan a > > directory of Parquet files and then determine the maximum "sequence > number" > > (id) across all files. This is the solution I came up with, but is this > > correct? How would you do such a thing? > > > > I wrote the files with parquet-avro writer. > > > > try (DirectoryStream directoryStream = > > Files.newDirectoryStream(Paths.get("/tmp/parq-files"), filter)) { > > > > PrimitiveType type = > > Types.required(PrimitiveTypeName.INT64).named("seq"); > > Statistics stats = Statistics.getBuilderForReading(type).build(); > > > > for (java.nio.file.Path path : directoryStream) { > > ParquetFileReader reader = > > ParquetFileReader.open(HadoopInputFile.fromPath(new Path(path.toUri()), > new > > Configuration())); > > > > for (final BlockMetaData rowGroup : reader.getRowGroups()) { > > for (final ColumnChunkMetaData column : rowGroup.getColumns()) { > > if ("seq".equals(column.getPath().toDotString())) { > > stats.mergeStatistics(column.getStatistics()); > > } > > } > >} > > } > > > > Thanks. > > >
Finding Max Value of Column
Hello Gang, I am trying to build an application. One function it has is to scan a directory of Parquet files and then determine the maximum "sequence number" (id) across all files. This is the solution I came up with, but is this correct? How would you do such a thing? I wrote the files with parquet-avro writer. try (DirectoryStream directoryStream = Files.newDirectoryStream(Paths.get("/tmp/parq-files"), filter)) { PrimitiveType type = Types.required(PrimitiveTypeName.INT64).named("seq"); Statistics stats = Statistics.getBuilderForReading(type).build(); for (java.nio.file.Path path : directoryStream) { ParquetFileReader reader = ParquetFileReader.open(HadoopInputFile.fromPath(new Path(path.toUri()), new Configuration())); for (final BlockMetaData rowGroup : reader.getRowGroups()) { for (final ColumnChunkMetaData column : rowGroup.getColumns()) { if ("seq".equals(column.getPath().toDotString())) { stats.mergeStatistics(column.getStatistics()); } } } } Thanks.
[jira] [Created] (PARQUET-1782) Use Switch Statement in AvroRecordConverter
David Mollitor created PARQUET-1782: --- Summary: Use Switch Statement in AvroRecordConverter Key: PARQUET-1782 URL: https://issues.apache.org/jira/browse/PARQUET-1782 Project: Parquet Issue Type: Improvement Reporter: David Mollitor Assignee: David Mollitor -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Issue Comment Deleted] (PARQUET-1778) Do Not Consider Class for Avro Generic Record Reader
[ https://issues.apache.org/jira/browse/PARQUET-1778?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Mollitor updated PARQUET-1778: Comment: was deleted (was: I think this is an Avro issue.) > Do Not Consider Class for Avro Generic Record Reader > > > Key: PARQUET-1778 > URL: https://issues.apache.org/jira/browse/PARQUET-1778 > Project: Parquet > Issue Type: Improvement > Reporter: David Mollitor > Assignee: David Mollitor >Priority: Major > > > {code:java|title=Example Code} > final ParquetReader reader = > AvroParquetReader.builder(path).build(); > final GenericRecord genericRecord = reader.read(); > {code} > It fails with... > {code:none} > java.lang.NoSuchMethodException: io.github.belugabehr.app.Record.() > at java.lang.Class.getConstructor0(Class.java:3082) ~[na:1.8.0_232] > at java.lang.Class.getDeclaredConstructor(Class.java:2178) > ~[na:1.8.0_232] > at > org.apache.avro.specific.SpecificData$1.computeValue(SpecificData.java:63) > ~[avro-1.9.1.jar:1.9.1] > at > org.apache.avro.specific.SpecificData$1.computeValue(SpecificData.java:58) > ~[avro-1.9.1.jar:1.9.1] > at java.lang.ClassValue.getFromHashMap(ClassValue.java:227) > ~[na:1.8.0_232] > at java.lang.ClassValue.getFromBackup(ClassValue.java:209) > ~[na:1.8.0_232] > at java.lang.ClassValue.get(ClassValue.java:115) ~[na:1.8.0_232] > at > org.apache.avro.specific.SpecificData.newInstance(SpecificData.java:470) > ~[avro-1.9.1.jar:1.9.1] > at > org.apache.avro.specific.SpecificData.newRecord(SpecificData.java:491) > ~[avro-1.9.1.jar:1.9.1] > at > org.apache.parquet.avro.AvroRecordConverter.start(AvroRecordConverter.java:404) > ~[parquet-avro-1.11.0.jar:1.11.0] > at > org.apache.parquet.io.RecordReaderImplementation.read(RecordReaderImplementation.java:392) > ~[parquet-column-1.11.0.jar:1.11.0] > at > org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:226) > ~[parquet-hadoop-1.11.0.jar:1.11.0] > at org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:132) > ~[parquet-hadoop-1.11.0.jar:1.11.0] > at org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:136) > ~[parquet-hadoop-1.11.0.jar:1.11.0] > {code} > I was surprised because it should just load a {{GenericRecord}} view of the > data. But alas, I have the Avro Schema defined with the {{namespace}} and > {{name}} fields pointing to {{io.github.belugabehr.app.Record}} which just so > happens to be a real class on the class path, so it is trying to call the > public constructor on the class and this constructor does does not exist. > Regardless, the {{GenericRecordReader}} should just ignore this Avro Schema > namespace information. > I am putting {{GenericRecords}} into the Parquet file, I expect to get > {{GenericRecords}} back out when I read it. > If I hack the information in a Schema and change the {{namespace}} or > {{name}} fields to something bogus, it works as I would expect it to. It > successfully reads and returns a {{GenericRecord}}. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (PARQUET-1778) Do Not Consider Class for Avro Generic Record Reader
[ https://issues.apache.org/jira/browse/PARQUET-1778?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Mollitor reassigned PARQUET-1778: --- Assignee: David Mollitor > Do Not Consider Class for Avro Generic Record Reader > > > Key: PARQUET-1778 > URL: https://issues.apache.org/jira/browse/PARQUET-1778 > Project: Parquet > Issue Type: Improvement > Reporter: David Mollitor > Assignee: David Mollitor >Priority: Major > > > {code:java|title=Example Code} > final ParquetReader reader = > AvroParquetReader.builder(path).build(); > final GenericRecord genericRecord = reader.read(); > {code} > It fails with... > {code:none} > java.lang.NoSuchMethodException: io.github.belugabehr.app.Record.() > at java.lang.Class.getConstructor0(Class.java:3082) ~[na:1.8.0_232] > at java.lang.Class.getDeclaredConstructor(Class.java:2178) > ~[na:1.8.0_232] > at > org.apache.avro.specific.SpecificData$1.computeValue(SpecificData.java:63) > ~[avro-1.9.1.jar:1.9.1] > at > org.apache.avro.specific.SpecificData$1.computeValue(SpecificData.java:58) > ~[avro-1.9.1.jar:1.9.1] > at java.lang.ClassValue.getFromHashMap(ClassValue.java:227) > ~[na:1.8.0_232] > at java.lang.ClassValue.getFromBackup(ClassValue.java:209) > ~[na:1.8.0_232] > at java.lang.ClassValue.get(ClassValue.java:115) ~[na:1.8.0_232] > at > org.apache.avro.specific.SpecificData.newInstance(SpecificData.java:470) > ~[avro-1.9.1.jar:1.9.1] > at > org.apache.avro.specific.SpecificData.newRecord(SpecificData.java:491) > ~[avro-1.9.1.jar:1.9.1] > at > org.apache.parquet.avro.AvroRecordConverter.start(AvroRecordConverter.java:404) > ~[parquet-avro-1.11.0.jar:1.11.0] > at > org.apache.parquet.io.RecordReaderImplementation.read(RecordReaderImplementation.java:392) > ~[parquet-column-1.11.0.jar:1.11.0] > at > org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:226) > ~[parquet-hadoop-1.11.0.jar:1.11.0] > at org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:132) > ~[parquet-hadoop-1.11.0.jar:1.11.0] > at org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:136) > ~[parquet-hadoop-1.11.0.jar:1.11.0] > {code} > I was surprised because it should just load a {{GenericRecord}} view of the > data. But alas, I have the Avro Schema defined with the {{namespace}} and > {{name}} fields pointing to {{io.github.belugabehr.app.Record}} which just so > happens to be a real class on the class path, so it is trying to call the > public constructor on the class and this constructor does does not exist. > Regardless, the {{GenericRecordReader}} should just ignore this Avro Schema > namespace information. > I am putting {{GenericRecords}} into the Parquet file, I expect to get > {{GenericRecords}} back out when I read it. > If I hack the information in a Schema and change the {{namespace}} or > {{name}} fields to something bogus, it works as I would expect it to. It > successfully reads and returns a {{GenericRecord}}. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (PARQUET-1778) Do Not Consider Class for Avro Generic Record Reader
[ https://issues.apache.org/jira/browse/PARQUET-1778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17028170#comment-17028170 ] David Mollitor commented on PARQUET-1778: - I think this is an Avro issue. > Do Not Consider Class for Avro Generic Record Reader > > > Key: PARQUET-1778 > URL: https://issues.apache.org/jira/browse/PARQUET-1778 > Project: Parquet > Issue Type: Improvement > Reporter: David Mollitor >Priority: Major > > > {code:java|title=Example Code} > final ParquetReader reader = > AvroParquetReader.builder(path).build(); > final GenericRecord genericRecord = reader.read(); > {code} > It fails with... > {code:none} > java.lang.NoSuchMethodException: io.github.belugabehr.app.Record.() > at java.lang.Class.getConstructor0(Class.java:3082) ~[na:1.8.0_232] > at java.lang.Class.getDeclaredConstructor(Class.java:2178) > ~[na:1.8.0_232] > at > org.apache.avro.specific.SpecificData$1.computeValue(SpecificData.java:63) > ~[avro-1.9.1.jar:1.9.1] > at > org.apache.avro.specific.SpecificData$1.computeValue(SpecificData.java:58) > ~[avro-1.9.1.jar:1.9.1] > at java.lang.ClassValue.getFromHashMap(ClassValue.java:227) > ~[na:1.8.0_232] > at java.lang.ClassValue.getFromBackup(ClassValue.java:209) > ~[na:1.8.0_232] > at java.lang.ClassValue.get(ClassValue.java:115) ~[na:1.8.0_232] > at > org.apache.avro.specific.SpecificData.newInstance(SpecificData.java:470) > ~[avro-1.9.1.jar:1.9.1] > at > org.apache.avro.specific.SpecificData.newRecord(SpecificData.java:491) > ~[avro-1.9.1.jar:1.9.1] > at > org.apache.parquet.avro.AvroRecordConverter.start(AvroRecordConverter.java:404) > ~[parquet-avro-1.11.0.jar:1.11.0] > at > org.apache.parquet.io.RecordReaderImplementation.read(RecordReaderImplementation.java:392) > ~[parquet-column-1.11.0.jar:1.11.0] > at > org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:226) > ~[parquet-hadoop-1.11.0.jar:1.11.0] > at org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:132) > ~[parquet-hadoop-1.11.0.jar:1.11.0] > at org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:136) > ~[parquet-hadoop-1.11.0.jar:1.11.0] > {code} > I was surprised because it should just load a {{GenericRecord}} view of the > data. But alas, I have the Avro Schema defined with the {{namespace}} and > {{name}} fields pointing to {{io.github.belugabehr.app.Record}} which just so > happens to be a real class on the class path, so it is trying to call the > public constructor on the class and this constructor does does not exist. > Regardless, the {{GenericRecordReader}} should just ignore this Avro Schema > namespace information. > I am putting {{GenericRecords}} into the Parquet file, I expect to get > {{GenericRecords}} back out when I read it. > If I hack the information in a Schema and change the {{namespace}} or > {{name}} fields to something bogus, it works as I would expect it to. It > successfully reads and returns a {{GenericRecord}}. -- This message was sent by Atlassian Jira (v8.3.4#803005)
Re: Parquet Verbose Logging
Hey Ryan, I think you understand my position correctly and articulated it well. My background is from higher up the stack; a consumer of these libraries. We may need to agree to disagree on this one. Projects these days include 100+ libraries and I don't want to have to set a custom log level for each one. Much easier for consumer of libraries to keep everything as quiet as possible and then only have to worry about a custom logging level when something goes wrong. Parquet in particular logs a lot of stuff at INFO level that is very specific to Parquet and would only be useful (if at all) to someone that really knows the library, not something that would be helpful to the higher level application developer. Thanks. On Fri, Jan 24, 2020 at 6:48 PM Ryan Blue wrote: > It sounds like we see logging differently. My approach is that for any > library, the type of information should be categorized using the same > criteria into log levels. For example, if it is a normal event you might > want to know about, use info. It looks like your approach is that the > levels should be set for information from the perspective of the end > application: is this behavior relevant to the end user? > > The problem is that you don't always know whether something is relevant to > the end user because that context depends on the application. For the > Parquet CLI, much more Parquet information is relevant than for Presto that > is scanning Parquet files. That's why I think it's best to categorize the > log information using a standard definition, and rely on the end > application to configure log levels for its users expectations. > > On Fri, Jan 24, 2020 at 10:29 AM David Mollitor wrote: > >> Hello Ryan, >> >> I appreciate you taking the time to share your thoughts. >> >> I'd just like to point out that there is also TRACE level logging if >> Parquet requires greater granularity. >> >> Furthermore, I'm not suggesting that there be an unbreakable rule that >> all logging must be DEBUG, but it should be the exception, not the rule. >> It is more likely the situation the the wrapping application would be >> responsible for logging at the INFO and WARN/ERROR level. Something >> like >> >> try { >>LOG.info("Using Parquet to read file {}", path); >>avroParquetReader.read(); >> } catch (Exception e) { >> LOG.error("Failed to read Parquet file", e); >> } >> >> This is a very normal setup and doesn't require any additional logging >> from the Parquet library itself. Once I see an error with "Failed to re >> Parquet file", then I'm going to turn on DEBUG logging and try to reproduce >> the error. >> >> Thanks, >> David >> >> On Fri, Jan 24, 2020 at 12:01 PM Ryan Blue >> wrote: >> >>> I don't agree with the idea to convert all of Parquet's logs to DEBUG >>> level, but I do think that we can improve the levels of individual >>> messages. >>> >>> If we convert all logs to debug, then turning on logs to see what Parquet >>> is doing would show everything from opening an input file to position >>> tracking in output files. That's way too much information, which is why >>> we >>> use different log levels to begin with. >>> >>> I think we should continue using log levels to distinguish between types >>> of >>> information: error for errors, warn for recoverable errors that may or >>> may >>> not indicate a problem, info for regular operations, and debug for extra >>> information if you're debugging the Parquet library. Following the common >>> convention enables people to choose what information they want instead of >>> mixing it all together. >>> >>> If you want to only see error and warning logs from Parquet, then the >>> right >>> way to do that is to configure your logger so that the level for >>> org.apache.parquet classes is warn. That's not to say I don't agree that >>> we >>> can cut down on what is logged at info and clean it up; I just don't >>> think >>> it's a good idea to abandon the idea of log levels to distinguish between >>> different information the user of a library will need. >>> >>> On Fri, Jan 24, 2020 at 6:30 AM lukas nalezenec >>> wrote: >>> >>> > Hi, >>> > I can help too. >>> > Lukas >>> > >>> > Dne pá 24. 1. 2020 15:29 uživatel David Mollitor >>> > napsal: >>> > >>> > > Hello Team, >>> > > >>> > > I am happy to do the w
[jira] [Updated] (PARQUET-1778) Do Not Consider Class for Avro Generic Record Reader
[ https://issues.apache.org/jira/browse/PARQUET-1778?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Mollitor updated PARQUET-1778: Description: {code:java|title=Example Code} final ParquetReader reader = AvroParquetReader.builder(path).build(); final GenericRecord genericRecord = reader.read(); {code} It fails with... {code:none} java.lang.NoSuchMethodException: io.github.belugabehr.app.Record.() at java.lang.Class.getConstructor0(Class.java:3082) ~[na:1.8.0_232] at java.lang.Class.getDeclaredConstructor(Class.java:2178) ~[na:1.8.0_232] at org.apache.avro.specific.SpecificData$1.computeValue(SpecificData.java:63) ~[avro-1.9.1.jar:1.9.1] at org.apache.avro.specific.SpecificData$1.computeValue(SpecificData.java:58) ~[avro-1.9.1.jar:1.9.1] at java.lang.ClassValue.getFromHashMap(ClassValue.java:227) ~[na:1.8.0_232] at java.lang.ClassValue.getFromBackup(ClassValue.java:209) ~[na:1.8.0_232] at java.lang.ClassValue.get(ClassValue.java:115) ~[na:1.8.0_232] at org.apache.avro.specific.SpecificData.newInstance(SpecificData.java:470) ~[avro-1.9.1.jar:1.9.1] at org.apache.avro.specific.SpecificData.newRecord(SpecificData.java:491) ~[avro-1.9.1.jar:1.9.1] at org.apache.parquet.avro.AvroRecordConverter.start(AvroRecordConverter.java:404) ~[parquet-avro-1.11.0.jar:1.11.0] at org.apache.parquet.io.RecordReaderImplementation.read(RecordReaderImplementation.java:392) ~[parquet-column-1.11.0.jar:1.11.0] at org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:226) ~[parquet-hadoop-1.11.0.jar:1.11.0] at org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:132) ~[parquet-hadoop-1.11.0.jar:1.11.0] at org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:136) ~[parquet-hadoop-1.11.0.jar:1.11.0] {code} I was surprised because it should just load a {{GenericRecord}} view of the data. But alas, I have the Avro Schema defined with the {{namespace}} and {{name}} fields pointing to {{io.github.belugabehr.app.Record}} which just so happens to be a real class on the class path, so it is trying to call the public constructor on the class and this constructor does does not exist. Regardless, the {{GenericRecordReader}} should just ignore this Avro Schema namespace information. I am putting {{GenericRecords}} into the Parquet file, I expect to get {{GenericRecords}} back out when I read it. If I hack the information in a Schema and change the {{namespace}} or {{name}} fields to something bogus, it works as I would expect it to. It successfully reads and returns a {{GenericRecord}}. was: {code:java|title=Example Code} final ParquetReader reader = AvroParquetReader.builder(path).build(); final GenericRecord genericRecord = reader.read(); {code} It fails with... {code:none} java.lang.NoSuchMethodException: io.github.belugabehr.app.Record.() at java.lang.Class.getConstructor0(Class.java:3082) ~[na:1.8.0_232] at java.lang.Class.getDeclaredConstructor(Class.java:2178) ~[na:1.8.0_232] at org.apache.avro.specific.SpecificData$1.computeValue(SpecificData.java:63) ~[avro-1.9.1.jar:1.9.1] at org.apache.avro.specific.SpecificData$1.computeValue(SpecificData.java:58) ~[avro-1.9.1.jar:1.9.1] at java.lang.ClassValue.getFromHashMap(ClassValue.java:227) ~[na:1.8.0_232] at java.lang.ClassValue.getFromBackup(ClassValue.java:209) ~[na:1.8.0_232] at java.lang.ClassValue.get(ClassValue.java:115) ~[na:1.8.0_232] at org.apache.avro.specific.SpecificData.newInstance(SpecificData.java:470) ~[avro-1.9.1.jar:1.9.1] at org.apache.avro.specific.SpecificData.newRecord(SpecificData.java:491) ~[avro-1.9.1.jar:1.9.1] at org.apache.parquet.avro.AvroRecordConverter.start(AvroRecordConverter.java:404) ~[parquet-avro-1.11.0.jar:1.11.0] at org.apache.parquet.io.RecordReaderImplementation.read(RecordReaderImplementation.java:392) ~[parquet-column-1.11.0.jar:1.11.0] at org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:226) ~[parquet-hadoop-1.11.0.jar:1.11.0] at org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:132) ~[parquet-hadoop-1.11.0.jar:1.11.0] at org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:136) ~[parquet-hadoop-1.11.0.jar:1.11.0] {code} I was surprised because it should just load a {{GenericRecord}} view of the data. But alas, I have the Avro Schema defined with the {{namespace}} and {{name}} fields pointing to {{io.github.belugabehr.app.Record}} which just so happens to be a real class on the class path, so it is trying to call the public constructor on the class and this constructor does does not exist. Regardless, the {{GenericRecordReader}} should just
[jira] [Updated] (PARQUET-1778) Do Not Consider Class for Avro Generic Record Reader
[ https://issues.apache.org/jira/browse/PARQUET-1778?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Mollitor updated PARQUET-1778: Description: {code:java|title=Example Code} final ParquetReader reader = AvroParquetReader.builder(path).build(); final GenericRecord genericRecord = reader.read(); {code} It fails with... {code:none} java.lang.NoSuchMethodException: io.github.belugabehr.app.Record.() at java.lang.Class.getConstructor0(Class.java:3082) ~[na:1.8.0_232] at java.lang.Class.getDeclaredConstructor(Class.java:2178) ~[na:1.8.0_232] at org.apache.avro.specific.SpecificData$1.computeValue(SpecificData.java:63) ~[avro-1.9.1.jar:1.9.1] at org.apache.avro.specific.SpecificData$1.computeValue(SpecificData.java:58) ~[avro-1.9.1.jar:1.9.1] at java.lang.ClassValue.getFromHashMap(ClassValue.java:227) ~[na:1.8.0_232] at java.lang.ClassValue.getFromBackup(ClassValue.java:209) ~[na:1.8.0_232] at java.lang.ClassValue.get(ClassValue.java:115) ~[na:1.8.0_232] at org.apache.avro.specific.SpecificData.newInstance(SpecificData.java:470) ~[avro-1.9.1.jar:1.9.1] at org.apache.avro.specific.SpecificData.newRecord(SpecificData.java:491) ~[avro-1.9.1.jar:1.9.1] at org.apache.parquet.avro.AvroRecordConverter.start(AvroRecordConverter.java:404) ~[parquet-avro-1.11.0.jar:1.11.0] at org.apache.parquet.io.RecordReaderImplementation.read(RecordReaderImplementation.java:392) ~[parquet-column-1.11.0.jar:1.11.0] at org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:226) ~[parquet-hadoop-1.11.0.jar:1.11.0] at org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:132) ~[parquet-hadoop-1.11.0.jar:1.11.0] at org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:136) ~[parquet-hadoop-1.11.0.jar:1.11.0] {code} I was surprised because it should just load a {{GenericRecord}} view of the data. But alas, I have the Avro Schema defined with the {{namespace}} and {{name}} fields pointing to {{io.github.belugabehr.app.Record}} which just so happens to be a real class on the class path, so it is trying to call the public constructor on the class and this constructor does does not exist. Regardless, the {{GenericRecordReader}} should just ignore this Avro Schema namespace information. I am putting {{GenericRecords}} into the Parquet file, I expect to get {{GenericRecords}} back out when I read it. was: {code:java|title=Example Code} final ParquetReader reader = AvroParquetReader.builder(path).build(); final GenericRecord genericRecord = reader.read(); {code} It fails with... {code:none} java.lang.NoSuchMethodException: io.github.belugabehr.app.Record.() at java.lang.Class.getConstructor0(Class.java:3082) ~[na:1.8.0_232] at java.lang.Class.getDeclaredConstructor(Class.java:2178) ~[na:1.8.0_232] at org.apache.avro.specific.SpecificData$1.computeValue(SpecificData.java:63) ~[avro-1.9.1.jar:1.9.1] at org.apache.avro.specific.SpecificData$1.computeValue(SpecificData.java:58) ~[avro-1.9.1.jar:1.9.1] at java.lang.ClassValue.getFromHashMap(ClassValue.java:227) ~[na:1.8.0_232] at java.lang.ClassValue.getFromBackup(ClassValue.java:209) ~[na:1.8.0_232] at java.lang.ClassValue.get(ClassValue.java:115) ~[na:1.8.0_232] {code} I was surprised because it should just load a {{GenericRecord}} view of the data. But alas, I have the Avro Schema defined with the {{namespace}} and {{name}} fields pointing to {{io.github.belugabehr.app.Record}} which just so happens to be a real class on the class path, so it is trying to call the public constructor on the class and this constructor does does not exist. Regardless, the {{GenericRecordReader}} should just ignore this Avro Schema namespace information. I am putting {{GenericRecords}} into the Parquet file, I expect to get {{GenericRecords}} back out when I read it. > Do Not Consider Class for Avro Generic Record Reader > > > Key: PARQUET-1778 > URL: https://issues.apache.org/jira/browse/PARQUET-1778 > Project: Parquet > Issue Type: Improvement > Reporter: David Mollitor >Priority: Major > > > {code:java|title=Example Code} > final ParquetReader reader = > AvroParquetReader.builder(path).build(); > final GenericRecord genericRecord = reader.read(); > {code} > It fails with... > {code:none} > java.lang.NoSuchMethodException: io.github.belugabehr.app.Record.() > at java.lang.Class.getConstructor0(Class.java:3082) ~[na:1.8.0_232] > at java.lang.Class.getDeclaredConstructor(Class.java:2178) > ~[na:1.8.0_232] > at > org.apache.avro.specific.Spe
[jira] [Updated] (PARQUET-1778) Do Not Consider Class for Avro Generic Record Reader
[ https://issues.apache.org/jira/browse/PARQUET-1778?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Mollitor updated PARQUET-1778: Summary: Do Not Consider Class for Avro Generic Record Reader (was: Do Not Record Class for Avro Generic Record Reader) > Do Not Consider Class for Avro Generic Record Reader > > > Key: PARQUET-1778 > URL: https://issues.apache.org/jira/browse/PARQUET-1778 > Project: Parquet > Issue Type: Improvement > Reporter: David Mollitor >Priority: Major > > > {code:java|title=Example Code} > final ParquetReader reader = > AvroParquetReader.builder(path).build(); > final GenericRecord genericRecord = reader.read(); > {code} > It fails with... > {code:none} > java.lang.NoSuchMethodException: io.github.belugabehr.app.Record.() > at java.lang.Class.getConstructor0(Class.java:3082) ~[na:1.8.0_232] > at java.lang.Class.getDeclaredConstructor(Class.java:2178) > ~[na:1.8.0_232] > at > org.apache.avro.specific.SpecificData$1.computeValue(SpecificData.java:63) > ~[avro-1.9.1.jar:1.9.1] > at > org.apache.avro.specific.SpecificData$1.computeValue(SpecificData.java:58) > ~[avro-1.9.1.jar:1.9.1] > at java.lang.ClassValue.getFromHashMap(ClassValue.java:227) > ~[na:1.8.0_232] > at java.lang.ClassValue.getFromBackup(ClassValue.java:209) > ~[na:1.8.0_232] > at java.lang.ClassValue.get(ClassValue.java:115) ~[na:1.8.0_232] > {code} > I was surprised because it should just load a {{GenericRecord}} view of the > data. But alas, I have the Avro Schema defined with the {{namespace}} and > {{name}} fields pointing to {{io.github.belugabehr.app.Record}} which just so > happens to be a real class on the class path, so it is trying to call the > public constructor on the class and this constructor does does not exist. > Regardless, the {{GenericRecordReader}} should just ignore this Avro Schema > namespace information. > I am putting {{GenericRecords}} into the Parquet file, I expect to get > {{GenericRecords}} back out when I read it. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (PARQUET-1778) Do Not Record Class for Avro Generic Record Reader
[ https://issues.apache.org/jira/browse/PARQUET-1778?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Mollitor updated PARQUET-1778: Description: {code:java|title=Example Code} final ParquetReader reader = AvroParquetReader.builder(path).build(); final GenericRecord genericRecord = reader.read(); {code} It fails with... {code:none} java.lang.NoSuchMethodException: io.github.belugabehr.app.Record.() at java.lang.Class.getConstructor0(Class.java:3082) ~[na:1.8.0_232] at java.lang.Class.getDeclaredConstructor(Class.java:2178) ~[na:1.8.0_232] at org.apache.avro.specific.SpecificData$1.computeValue(SpecificData.java:63) ~[avro-1.9.1.jar:1.9.1] at org.apache.avro.specific.SpecificData$1.computeValue(SpecificData.java:58) ~[avro-1.9.1.jar:1.9.1] at java.lang.ClassValue.getFromHashMap(ClassValue.java:227) ~[na:1.8.0_232] at java.lang.ClassValue.getFromBackup(ClassValue.java:209) ~[na:1.8.0_232] at java.lang.ClassValue.get(ClassValue.java:115) ~[na:1.8.0_232] {code} I was surprised because it should just load a {{GenericRecord}} view of the data. But alas, I have the Avro Schema defined with the {{namespace}} and {{name}} fields pointing to {{io.github.belugabehr.app.Record}} which just so happens to be a real class on the class path, so it is trying to call the public constructor on the class and this constructor does does not exist. Regardless, the {{GenericRecordReader}} should just ignore this Avro Schema namespace information. I am putting {{GenericRecords}} into the Parquet file, I expect to get {{GenericRecords}} back out when I read it. was: {code:java|title=Example Code} final ParquetReader reader = AvroParquetReader.builder(path).build(); final GenericRecord genericRecord = reader.read(); {code} It fails with... {code:none} java.lang.NoSuchMethodException: io.github.belugabehr.app.Record.() at java.lang.Class.getConstructor0(Class.java:3082) ~[na:1.8.0_232] at java.lang.Class.getDeclaredConstructor(Class.java:2178) ~[na:1.8.0_232] at org.apache.avro.specific.SpecificData$1.computeValue(SpecificData.java:63) ~[avro-1.9.1.jar:1.9.1] at org.apache.avro.specific.SpecificData$1.computeValue(SpecificData.java:58) ~[avro-1.9.1.jar:1.9.1] at java.lang.ClassValue.getFromHashMap(ClassValue.java:227) ~[na:1.8.0_232] at java.lang.ClassValue.getFromBackup(ClassValue.java:209) ~[na:1.8.0_232] at java.lang.ClassValue.get(ClassValue.java:115) ~[na:1.8.0_232] {code} I was surprised because it should just load a {{GenericRecord}} view of the data. But alas, I have the Avro Schema defined with the {{namespace}} and {{name}} fields pointing to {{io.github.belugabehr.app.Record}} which just so happens to be a real class on the class path, so it is trying to call the public constructor on the class which does not exist. There {{GenericRecordReader}} should always ignore this Avro Schema namespace information. I am putting {{GenericRecords}} into the Parquet file, I expect to get {{GenericRecords}} back out when I read it. > Do Not Record Class for Avro Generic Record Reader > -- > > Key: PARQUET-1778 > URL: https://issues.apache.org/jira/browse/PARQUET-1778 > Project: Parquet > Issue Type: Improvement > Reporter: David Mollitor >Priority: Major > > > {code:java|title=Example Code} > final ParquetReader reader = > AvroParquetReader.builder(path).build(); > final GenericRecord genericRecord = reader.read(); > {code} > It fails with... > {code:none} > java.lang.NoSuchMethodException: io.github.belugabehr.app.Record.() > at java.lang.Class.getConstructor0(Class.java:3082) ~[na:1.8.0_232] > at java.lang.Class.getDeclaredConstructor(Class.java:2178) > ~[na:1.8.0_232] > at > org.apache.avro.specific.SpecificData$1.computeValue(SpecificData.java:63) > ~[avro-1.9.1.jar:1.9.1] > at > org.apache.avro.specific.SpecificData$1.computeValue(SpecificData.java:58) > ~[avro-1.9.1.jar:1.9.1] > at java.lang.ClassValue.getFromHashMap(ClassValue.java:227) > ~[na:1.8.0_232] > at java.lang.ClassValue.getFromBackup(ClassValue.java:209) > ~[na:1.8.0_232] > at java.lang.ClassValue.get(ClassValue.java:115) ~[na:1.8.0_232] > {code} > I was surprised because it should just load a {{GenericRecord}} view of the > data. But alas, I have the Avro Schema defined with the {{namespace}} and > {{name}} fields pointing to {{io.github.belugabehr.app.Record}} which just so > happens to be a real class on the class path, so it is trying to call the > public constructor on the class and this constructor does does not exist. > Regardless, t
[jira] [Updated] (PARQUET-1778) Do Not Record Class for Avro Generic Record Reader
[ https://issues.apache.org/jira/browse/PARQUET-1778?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Mollitor updated PARQUET-1778: Description: {code:java|title=Example Code} final ParquetReader reader = AvroParquetReader.builder(path).build(); final GenericRecord genericRecord = reader.read(); {code} It fails with... {code:none} java.lang.NoSuchMethodException: io.github.belugabehr.app.Record.() at java.lang.Class.getConstructor0(Class.java:3082) ~[na:1.8.0_232] at java.lang.Class.getDeclaredConstructor(Class.java:2178) ~[na:1.8.0_232] at org.apache.avro.specific.SpecificData$1.computeValue(SpecificData.java:63) ~[avro-1.9.1.jar:1.9.1] at org.apache.avro.specific.SpecificData$1.computeValue(SpecificData.java:58) ~[avro-1.9.1.jar:1.9.1] at java.lang.ClassValue.getFromHashMap(ClassValue.java:227) ~[na:1.8.0_232] at java.lang.ClassValue.getFromBackup(ClassValue.java:209) ~[na:1.8.0_232] at java.lang.ClassValue.get(ClassValue.java:115) ~[na:1.8.0_232] {code} I was surprised because it should just load a {{GenericRecord}} view of the data. But alas, I have the Avro Schema defined with the {{namespace}} and {{name}} fields pointing to {{io.github.belugabehr.app.Record}} which just so happens to be a real class on the class path, so it is trying to call the public constructor on the class which does not exist. There {{GenericRecordReader}} should always ignore this Avro Schema namespace information. I am putting {{GenericRecords}} into the Parquet file, I expect to get {{GenericRecords}} back out when I read it. was: {code:java} final ParquetReader reader = AvroParquetReader.builder(path).build();final ParquetReader reader = AvroParquetReader.builder(path).build(); final GenericRecord genericRecord = reader.read(); {code} It fails with... {code:none} java.lang.NoSuchMethodException: io.github.belugabehr.app.Record.() at java.lang.Class.getConstructor0(Class.java:3082) ~[na:1.8.0_232] at java.lang.Class.getDeclaredConstructor(Class.java:2178) ~[na:1.8.0_232] at org.apache.avro.specific.SpecificData$1.computeValue(SpecificData.java:63) ~[avro-1.9.1.jar:1.9.1] at org.apache.avro.specific.SpecificData$1.computeValue(SpecificData.java:58) ~[avro-1.9.1.jar:1.9.1] at java.lang.ClassValue.getFromHashMap(ClassValue.java:227) ~[na:1.8.0_232] at java.lang.ClassValue.getFromBackup(ClassValue.java:209) ~[na:1.8.0_232] at java.lang.ClassValue.get(ClassValue.java:115) ~[na:1.8.0_232] {code} I was surprised because it should just load a {{GenericRecord}} view of the data. But alas, I have the Avro Schema defined with the {{namespace}} and {{name}} fields pointing to {{io.github.belugabehr.app.Record}} which just so happens to be a real class on the class path, so it is trying to call the public constructor on the class which does not exist. There {{GenericRecordReader}} should always ignore this Avro Schema namespace information. I am putting {{GenericRecords}} into the Parquet file, I expect to get {{GenericRecords}} back out when I read it. > Do Not Record Class for Avro Generic Record Reader > -- > > Key: PARQUET-1778 > URL: https://issues.apache.org/jira/browse/PARQUET-1778 > Project: Parquet > Issue Type: Improvement > Reporter: David Mollitor >Priority: Major > > > {code:java|title=Example Code} > final ParquetReader reader = > AvroParquetReader.builder(path).build(); > final GenericRecord genericRecord = reader.read(); > {code} > It fails with... > {code:none} > java.lang.NoSuchMethodException: io.github.belugabehr.app.Record.() > at java.lang.Class.getConstructor0(Class.java:3082) ~[na:1.8.0_232] > at java.lang.Class.getDeclaredConstructor(Class.java:2178) > ~[na:1.8.0_232] > at > org.apache.avro.specific.SpecificData$1.computeValue(SpecificData.java:63) > ~[avro-1.9.1.jar:1.9.1] > at > org.apache.avro.specific.SpecificData$1.computeValue(SpecificData.java:58) > ~[avro-1.9.1.jar:1.9.1] > at java.lang.ClassValue.getFromHashMap(ClassValue.java:227) > ~[na:1.8.0_232] > at java.lang.ClassValue.getFromBackup(ClassValue.java:209) > ~[na:1.8.0_232] > at java.lang.ClassValue.get(ClassValue.java:115) ~[na:1.8.0_232] > {code} > I was surprised because it should just load a {{GenericRecord}} view of the > data. But alas, I have the Avro Schema defined with the {{namespace}} and > {{name}} fields pointing to {{io.github.belugabehr.app.Record}} which just so > happens to be a real class on the class path, so it is trying to call the > public constructor on the class which does not exist. > There
[jira] [Created] (PARQUET-1778) Do Not Record Class for Avro Generic Record Reader
David Mollitor created PARQUET-1778: --- Summary: Do Not Record Class for Avro Generic Record Reader Key: PARQUET-1778 URL: https://issues.apache.org/jira/browse/PARQUET-1778 Project: Parquet Issue Type: Improvement Reporter: David Mollitor {code:java} final ParquetReader reader = AvroParquetReader.builder(path).build();final ParquetReader reader = AvroParquetReader.builder(path).build(); final GenericRecord genericRecord = reader.read(); {code} It fails with... {code:none} java.lang.NoSuchMethodException: io.github.belugabehr.app.Record.() at java.lang.Class.getConstructor0(Class.java:3082) ~[na:1.8.0_232] at java.lang.Class.getDeclaredConstructor(Class.java:2178) ~[na:1.8.0_232] at org.apache.avro.specific.SpecificData$1.computeValue(SpecificData.java:63) ~[avro-1.9.1.jar:1.9.1] at org.apache.avro.specific.SpecificData$1.computeValue(SpecificData.java:58) ~[avro-1.9.1.jar:1.9.1] at java.lang.ClassValue.getFromHashMap(ClassValue.java:227) ~[na:1.8.0_232] at java.lang.ClassValue.getFromBackup(ClassValue.java:209) ~[na:1.8.0_232] at java.lang.ClassValue.get(ClassValue.java:115) ~[na:1.8.0_232] {code} I was surprised because it should just load a {{GenericRecord}} view of the data. But alas, I have the Avro Schema defined with the {{namespace}} and {{name}} fields pointing to {{io.github.belugabehr.app.Record}} which just so happens to be a real class on the class path, so it is trying to call the public constructor on the class which does not exist. There {{GenericRecordReader}} should always ignore this Avro Schema namespace information. I am putting {{GenericRecords}} into the Parquet file, I expect to get {{GenericRecords}} back out when I read it. -- This message was sent by Atlassian Jira (v8.3.4#803005)
Re: Parquet Verbose Logging
Hello Ryan, I appreciate you taking the time to share your thoughts. I'd just like to point out that there is also TRACE level logging if Parquet requires greater granularity. Furthermore, I'm not suggesting that there be an unbreakable rule that all logging must be DEBUG, but it should be the exception, not the rule. It is more likely the situation the the wrapping application would be responsible for logging at the INFO and WARN/ERROR level. Something like try { LOG.info("Using Parquet to read file {}", path); avroParquetReader.read(); } catch (Exception e) { LOG.error("Failed to read Parquet file", e); } This is a very normal setup and doesn't require any additional logging from the Parquet library itself. Once I see an error with "Failed to re Parquet file", then I'm going to turn on DEBUG logging and try to reproduce the error. Thanks, David On Fri, Jan 24, 2020 at 12:01 PM Ryan Blue wrote: > I don't agree with the idea to convert all of Parquet's logs to DEBUG > level, but I do think that we can improve the levels of individual > messages. > > If we convert all logs to debug, then turning on logs to see what Parquet > is doing would show everything from opening an input file to position > tracking in output files. That's way too much information, which is why we > use different log levels to begin with. > > I think we should continue using log levels to distinguish between types of > information: error for errors, warn for recoverable errors that may or may > not indicate a problem, info for regular operations, and debug for extra > information if you're debugging the Parquet library. Following the common > convention enables people to choose what information they want instead of > mixing it all together. > > If you want to only see error and warning logs from Parquet, then the right > way to do that is to configure your logger so that the level for > org.apache.parquet classes is warn. That's not to say I don't agree that we > can cut down on what is logged at info and clean it up; I just don't think > it's a good idea to abandon the idea of log levels to distinguish between > different information the user of a library will need. > > On Fri, Jan 24, 2020 at 6:30 AM lukas nalezenec wrote: > > > Hi, > > I can help too. > > Lukas > > > > Dne pá 24. 1. 2020 15:29 uživatel David Mollitor > > napsal: > > > > > Hello Team, > > > > > > I am happy to do the work of reviewing all Parquet logging, but I need > > help > > > getting the work committed. > > > > > > Fokko Driesprong has been a wonderfully ally in helping me get > > incremental > > > improvements into Parquet, but I wonder if there's anyone else that can > > > share in the load. > > > > > > Thanks, > > > David > > > > > > On Thu, Jan 23, 2020 at 11:55 AM Michael Heuer > > wrote: > > > > > > > Hello David, > > > > > > > > As I mentioned on PARQUET-1758, we have been frustrated by overly > > verbose > > > > logging in Parquet for a long time. Various workarounds have been > more > > > or > > > > less successful, e.g. > > > > > > > > https://github.com/bigdatagenomics/adam/issues/851 < > > > > https://github.com/bigdatagenomics/adam/issues/851> > > > > > > > > I would support a move making Parquet a silent partner. :) > > > > > > > >michael > > > > > > > > > > > > > On Jan 23, 2020, at 10:25 AM, David Mollitor > > > wrote: > > > > > > > > > > Hello Team, > > > > > > > > > > I have been a consumer of Apache Parquet through Apache Hive for > > > several > > > > > years now. For a long time, logging in Parquet has been pretty > > > painful. > > > > > Some of the logging was going to STDOUT and some was going to > Log4J. > > > > > Overall, though the framework has been too verbose, spewing many > log > > > > lines > > > > > about internal details of Parquet I don't understand. > > > > > > > > > > The logging has gotten a lot better with recent releases moving > > solidly > > > > > into SLF4J. That is awesome and very welcomed. However, (opinion > > > > alert) I > > > > > think the logging is still too verbose. I think Parquet should be > a > > > > silent > > > > > partner in data processing. If everything is going well, it should > > be > > >
Re: Writing to Local File
Thanks Ryan for the confirmation of my suspicions. That would certainly make a quick sample application easier to achieve from an adoption perspective. I had just put this JIRA in. I'll leave it open for anyone to jump in on. https://issues.apache.org/jira/browse/PARQUET-1776 Thanks, David On Fri, Jan 24, 2020 at 12:08 PM Ryan Blue wrote: > There's not currently a way to do this without Hadoop. We've been working > on moving to the `InputFile` and `OutputFile` abstractions so that we can > get rid of it, but Parquet still depends on Hadoop libraries for > compression and we haven't pulled out the parts of Parquet that use the new > abstraction from the older ones that accept Hadoop Paths, so you need to > have Hadoop in your classpath either way. > > To get to where you can write a file without Hadoop dependencies, I think > we need to create a new module that parquet-hadoop will depend on with the > `InputFile`/`OutputFile` implementation. Then we would refactor the Hadoop > classes to extend those implementations to avoid breaking the Hadoop > classes. We'd also need to implement the compression API directly on top of > aircompressor in this module. > > On Thu, Jan 23, 2020 at 4:40 PM David Mollitor wrote: > > > I am usually a user of Parquet through Hive or Spark, but I wanted to sit > > down and write my own small example application of using the library > > directly. > > > > Is there some quick way that I can write a Parquet file to the local file > > system using java.nio.Path (i.e., with no Hadoop dependencies?) > > > > Thanks! > > > > > -- > Ryan Blue > Software Engineer > Netflix >
[jira] [Updated] (PARQUET-1776) Add Java NIO Avro OutputFile InputFile
[ https://issues.apache.org/jira/browse/PARQUET-1776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Mollitor updated PARQUET-1776: Description: Add a wrapper around Java NIO Path for {{org.apache.parquet.io.OutputFile}} and {{org.apache.parquet.io.InputFile}} (was: Add a wrapper around Java NIO for {{org.apache.parquet.io.OutputFile}} and {{org.apache.parquet.io.InputFile}}) > Add Java NIO Avro OutputFile InputFile > -- > > Key: PARQUET-1776 > URL: https://issues.apache.org/jira/browse/PARQUET-1776 > Project: Parquet > Issue Type: Improvement > Components: parquet-avro > Reporter: David Mollitor >Priority: Minor > > Add a wrapper around Java NIO Path for {{org.apache.parquet.io.OutputFile}} > and {{org.apache.parquet.io.InputFile}} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (PARQUET-1776) Add Java NIO Avro OutputFile InputFile
[ https://issues.apache.org/jira/browse/PARQUET-1776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Mollitor updated PARQUET-1776: Labels: (was: avro) > Add Java NIO Avro OutputFile InputFile > -- > > Key: PARQUET-1776 > URL: https://issues.apache.org/jira/browse/PARQUET-1776 > Project: Parquet > Issue Type: Improvement > Components: parquet-avro > Reporter: David Mollitor >Priority: Minor > > Add a wrapper around Java NIO for {{org.apache.parquet.io.OutputFile}} and > {{org.apache.parquet.io.InputFile}} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (PARQUET-1776) Add Java NIO Avro OutputFile InputFile
David Mollitor created PARQUET-1776: --- Summary: Add Java NIO Avro OutputFile InputFile Key: PARQUET-1776 URL: https://issues.apache.org/jira/browse/PARQUET-1776 Project: Parquet Issue Type: Improvement Reporter: David Mollitor Add a wrapper around Java NIO for {{org.apache.parquet.io.OutputFile}} and {{org.apache.parquet.io.InputFile}} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (PARQUET-1776) Add Java NIO Avro OutputFile InputFile
[ https://issues.apache.org/jira/browse/PARQUET-1776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Mollitor updated PARQUET-1776: Labels: avro (was: ) > Add Java NIO Avro OutputFile InputFile > -- > > Key: PARQUET-1776 > URL: https://issues.apache.org/jira/browse/PARQUET-1776 > Project: Parquet > Issue Type: Improvement > Reporter: David Mollitor >Priority: Minor > Labels: avro > > Add a wrapper around Java NIO for {{org.apache.parquet.io.OutputFile}} and > {{org.apache.parquet.io.InputFile}} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (PARQUET-1776) Add Java NIO Avro OutputFile InputFile
[ https://issues.apache.org/jira/browse/PARQUET-1776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Mollitor updated PARQUET-1776: Component/s: parquet-avro > Add Java NIO Avro OutputFile InputFile > -- > > Key: PARQUET-1776 > URL: https://issues.apache.org/jira/browse/PARQUET-1776 > Project: Parquet > Issue Type: Improvement > Components: parquet-avro > Reporter: David Mollitor >Priority: Minor > Labels: avro > > Add a wrapper around Java NIO for {{org.apache.parquet.io.OutputFile}} and > {{org.apache.parquet.io.InputFile}} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (PARQUET-1775) Deprecate AvroParquetWriter Builder Hadoop Path
David Mollitor created PARQUET-1775: --- Summary: Deprecate AvroParquetWriter Builder Hadoop Path Key: PARQUET-1775 URL: https://issues.apache.org/jira/browse/PARQUET-1775 Project: Parquet Issue Type: Improvement Reporter: David Mollitor Assignee: David Mollitor Trying to write a sample program with Parquet and came across the following quark: The {{AvroParquetWriter}} has no qualms about building one using {{org.apache.hadoop.fs.Path}}. However, doing so in {{AvroParquetReader}} is deprecated. I think it's appropriate to remove all dependencies of Hadoop from this simple reader/writer API. To make it consistent, also deprecate the use of {{org.apache.hadoop.fs.Path}} in the {{AvroParquetWriter.}} [https://github.com/apache/parquet-mr/blob/8c1bc9bcdeeac8178fecf61d18dc56913907fd46/parquet-avro/src/main/java/org/apache/parquet/avro/AvroParquetWriter.java#L38] https://github.com/apache/parquet-mr/blob/8c1bc9bcdeeac8178fecf61d18dc56913907fd46/parquet-avro/src/main/java/org/apache/parquet/avro/AvroParquetReader.java#L47 -- This message was sent by Atlassian Jira (v8.3.4#803005)
Re: Parquet Verbose Logging
Hello Team, I am happy to do the work of reviewing all Parquet logging, but I need help getting the work committed. Fokko Driesprong has been a wonderfully ally in helping me get incremental improvements into Parquet, but I wonder if there's anyone else that can share in the load. Thanks, David On Thu, Jan 23, 2020 at 11:55 AM Michael Heuer wrote: > Hello David, > > As I mentioned on PARQUET-1758, we have been frustrated by overly verbose > logging in Parquet for a long time. Various workarounds have been more or > less successful, e.g. > > https://github.com/bigdatagenomics/adam/issues/851 < > https://github.com/bigdatagenomics/adam/issues/851> > > I would support a move making Parquet a silent partner. :) > >michael > > > > On Jan 23, 2020, at 10:25 AM, David Mollitor wrote: > > > > Hello Team, > > > > I have been a consumer of Apache Parquet through Apache Hive for several > > years now. For a long time, logging in Parquet has been pretty painful. > > Some of the logging was going to STDOUT and some was going to Log4J. > > Overall, though the framework has been too verbose, spewing many log > lines > > about internal details of Parquet I don't understand. > > > > The logging has gotten a lot better with recent releases moving solidly > > into SLF4J. That is awesome and very welcomed. However, (opinion > alert) I > > think the logging is still too verbose. I think Parquet should be a > silent > > partner in data processing. If everything is going well, it should be > > silent (DEBUG level logging). If things are going wrong, it should throw > > an Exception. > > > > If an operator suspects Parquet is the issue (and that's rarely the first > > thing to check), they can set the logging for all of the Loggers in the > > entire Parquet package (org.apache.parquet) to DEBUG to get the required > > information. Not to mention, the less logging it does, the faster it > will > > be. > > > > I've opened this discussion because I've got two PRs related to this > topic > > ready to go: > > > > PARQUET-1758 > > PARQUET-1761 > > > > Thanks, > > David > >
Writing to Local File
I am usually a user of Parquet through Hive or Spark, but I wanted to sit down and write my own small example application of using the library directly. Is there some quick way that I can write a Parquet file to the local file system using java.nio.Path (i.e., with no Hadoop dependencies?) Thanks!
[jira] [Updated] (PARQUET-1758) InternalParquetRecordReader Logging is Too Verbose
[ https://issues.apache.org/jira/browse/PARQUET-1758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Mollitor updated PARQUET-1758: Summary: InternalParquetRecordReader Logging is Too Verbose (was: InternalParquetRecordReader Logging it Too Verbose) > InternalParquetRecordReader Logging is Too Verbose > -- > > Key: PARQUET-1758 > URL: https://issues.apache.org/jira/browse/PARQUET-1758 > Project: Parquet > Issue Type: Improvement > Reporter: David Mollitor > Assignee: David Mollitor >Priority: Minor > Labels: pull-request-available > > A low-level library like Parquet should be pretty quiet. It should just do > its work and keep quiet. Most issues should be addressed by throwing > Exceptions, and the occasional warning message otherwise it will clutter the > logging for the top-level application. If debugging is required, > administrator can enable it for the specific workload. > *Warning:* This is my opinion. No stats to back it up. -- This message was sent by Atlassian Jira (v8.3.4#803005)
Parquet Verbose Logging
Hello Team, I have been a consumer of Apache Parquet through Apache Hive for several years now. For a long time, logging in Parquet has been pretty painful. Some of the logging was going to STDOUT and some was going to Log4J. Overall, though the framework has been too verbose, spewing many log lines about internal details of Parquet I don't understand. The logging has gotten a lot better with recent releases moving solidly into SLF4J. That is awesome and very welcomed. However, (opinion alert) I think the logging is still too verbose. I think Parquet should be a silent partner in data processing. If everything is going well, it should be silent (DEBUG level logging). If things are going wrong, it should throw an Exception. If an operator suspects Parquet is the issue (and that's rarely the first thing to check), they can set the logging for all of the Loggers in the entire Parquet package (org.apache.parquet) to DEBUG to get the required information. Not to mention, the less logging it does, the faster it will be. I've opened this discussion because I've got two PRs related to this topic ready to go: PARQUET-1758 PARQUET-1761 Thanks, David
Re: Spotless
I think you want this in place before bloom filters are released. Since it's the newest code, it is most at risk of receiving fixes and improvements. You're not going to want to use spotless after the feature is introduced and make back ports more difficult. On Wed, Jan 22, 2020, 10:02 AM Driesprong, Fokko wrote: > I've rebased the PR: https://github.com/apache/parquet-mr/pull/730 > > I did some searching and as far as I can tell, spotless does not allow to > only apply it to VCS changed lines. If the forked repo also applies > spotless, then it would be possible to do a diff. > > For me, I'm still interested in applying this, so we can keep our code > clean and consistent. For example, I would like to enforce the use of > braces, as it makes the code much more readable in my opinion. > > Cheers, Fokko > > > > Op do 9 jan. 2020 om 10:23 schreef Gabor Szadovszky : > > > Personally, I don't like formatting the whole code during minor version > > development. These changes make really hard cherry-picking changes to > > forked repos. It also makes hard to blame the code. > > It is great to have a common code style and formatting configuration but > I > > would only apply them to the new lines. Let's do such changes that > impacts > > the whole code base at the beginning of a new major version development > > where compatibility will break anyway. > > > > I am hesitating to give a -1, though. If everyone agrees on this is a > good > > idea, I'm fine with that. So, let me give a -0. > > > > Cheers, > > Gabor > > > > On Wed, Jan 8, 2020 at 7:36 PM Ryan Blue > > wrote: > > > > > +1 for spotless checks. > > > > > > On Wed, Jan 8, 2020 at 7:13 AM Driesprong, Fokko > > > > wrote: > > > > > > > Y'all, > > > > > > > > Recently Chen Junjie brought up the removal of trailing spaces within > > the > > > > code and the headers: > > > > https://github.com/apache/parquet-mr/pull/727#issuecomment-571562392 > > > > > > > > I've been looking into this and looked into if we can apply something > > > like > > > > checkstyle to let it fail on trailing whitespace. However, it comes > up > > > with > > > > a LOT of warnings on improper formatting, short variables, wrong > import > > > > orders, etc. > > > > For Apache Avro we've added Spotless as a maven plugin: > > > > https://github.com/diffplug/spotless. Unlike checkstyle, spotless > will > > > > also > > > > fix the formatting. Would this be something that others find useful? > > > > The main problem is that we need to apply this to the codebase, and > > this > > > > will break a lot of PR's, and it will mess up a bit of the version > > > control, > > > > because a lot of lines will be changed: > > > > https://github.com/apache/parquet-mr/pull/730/ > > > > > > > > WDYT? > > > > > > > > Cheers, Fokko > > > > > > > > > > > > > -- > > > Ryan Blue > > > Software Engineer > > > Netflix > > > > > >
[jira] [Commented] (PARQUET-1758) InternalParquetRecordReader Logging it Too Verbose
[ https://issues.apache.org/jira/browse/PARQUET-1758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17014416#comment-17014416 ] David Mollitor commented on PARQUET-1758: - I think the general idea is that almost all logging is DEBUG level for such a library. It may be advantageous to setup YETUS so that the automated builds are with DEBUG log enabled, but my feeling is that most logging shouldn't be enabled by default. > InternalParquetRecordReader Logging it Too Verbose > -- > > Key: PARQUET-1758 > URL: https://issues.apache.org/jira/browse/PARQUET-1758 > Project: Parquet > Issue Type: Improvement > Reporter: David Mollitor > Assignee: David Mollitor >Priority: Minor > Labels: pull-request-available > > A low-level library like Parquet should be pretty quiet. It should just do > its work and keep quiet. Most issues should be addressed by throwing > Exceptions, and the occasional warning message otherwise it will clutter the > logging for the top-level application. If debugging is required, > administrator can enable it for the specific workload. > *Warning:* This is my opinion. No stats to back it up. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (PARQUET-1758) InternalParquetRecordReader Logging it Too Verbose
[ https://issues.apache.org/jira/browse/PARQUET-1758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17014010#comment-17014010 ] David Mollitor edited comment on PARQUET-1758 at 1/13/20 4:21 AM: -- I am certainly open for discussions. I too have had some logging pain emanating from Parquet with the Apache Hive project. Debug logging would only help performance since less time would be spent logging. was (Author: belugabehr): Debug logging would only help performance since less time would be spent logging. > InternalParquetRecordReader Logging it Too Verbose > -- > > Key: PARQUET-1758 > URL: https://issues.apache.org/jira/browse/PARQUET-1758 > Project: Parquet > Issue Type: Improvement > Reporter: David Mollitor > Assignee: David Mollitor >Priority: Minor > Labels: pull-request-available > > A low-level library like Parquet should be pretty quiet. It should just do > its work and keep quiet. Most issues should be addressed by throwing > Exceptions, and the occasional warning message otherwise it will clutter the > logging for the top-level application. If debugging is required, > administrator can enable it for the specific workload. > *Warning:* This is my opinion. No stats to back it up. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (PARQUET-1758) InternalParquetRecordReader Logging it Too Verbose
[ https://issues.apache.org/jira/browse/PARQUET-1758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17014010#comment-17014010 ] David Mollitor commented on PARQUET-1758: - Debug logging would only help performance since less time would be spent logging. > InternalParquetRecordReader Logging it Too Verbose > -- > > Key: PARQUET-1758 > URL: https://issues.apache.org/jira/browse/PARQUET-1758 > Project: Parquet > Issue Type: Improvement > Reporter: David Mollitor > Assignee: David Mollitor >Priority: Minor > Labels: pull-request-available > > A low-level library like Parquet should be pretty quiet. It should just do > its work and keep quiet. Most issues should be addressed by throwing > Exceptions, and the occasional warning message otherwise it will clutter the > logging for the top-level application. If debugging is required, > administrator can enable it for the specific workload. > *Warning:* This is my opinion. No stats to back it up. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (PARQUET-1763) Add SLF4J to TestCircularReferences
David Mollitor created PARQUET-1763: --- Summary: Add SLF4J to TestCircularReferences Key: PARQUET-1763 URL: https://issues.apache.org/jira/browse/PARQUET-1763 Project: Parquet Issue Type: Improvement Reporter: David Mollitor Assignee: David Mollitor Currently prints to STDOUT. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (PARQUET-1762) Move BitPackingPerfTest to parquet-benchmarks Module
David Mollitor created PARQUET-1762: --- Summary: Move BitPackingPerfTest to parquet-benchmarks Module Key: PARQUET-1762 URL: https://issues.apache.org/jira/browse/PARQUET-1762 Project: Parquet Issue Type: Improvement Reporter: David Mollitor -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (PARQUET-1761) Lower Logging Level in ParquetOutputFormat
David Mollitor created PARQUET-1761: --- Summary: Lower Logging Level in ParquetOutputFormat Key: PARQUET-1761 URL: https://issues.apache.org/jira/browse/PARQUET-1761 Project: Parquet Issue Type: Improvement Reporter: David Mollitor Assignee: David Mollitor -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (PARQUET-1760) Use SLF4J Logger for TestStatistics
David Mollitor created PARQUET-1760: --- Summary: Use SLF4J Logger for TestStatistics Key: PARQUET-1760 URL: https://issues.apache.org/jira/browse/PARQUET-1760 Project: Parquet Issue Type: Improvement Reporter: David Mollitor Assignee: David Mollitor It is dumping a lot of logging into STDOUT and STDERR. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (PARQUET-1759) InternalParquetRecordReader Use Singleton Set
David Mollitor created PARQUET-1759: --- Summary: InternalParquetRecordReader Use Singleton Set Key: PARQUET-1759 URL: https://issues.apache.org/jira/browse/PARQUET-1759 Project: Parquet Issue Type: Improvement Reporter: David Mollitor Assignee: David Mollitor https://github.com/apache/parquet-mr/blob/d85a8f5dcfc1381655fcccaa81a2e83ba812f6a4/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/InternalParquetRecordReader.java#L260-L262 Code currently instantiates a {{HashSet}} (with a default internal data structure of size 16}} and then makes it immutable. Use Collection#singleton to achieve this same goal with fewer lines of code and less memory requirements. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (PARQUET-1758) InternalParquetRecordReader Logging it Too Verbose
David Mollitor created PARQUET-1758: --- Summary: InternalParquetRecordReader Logging it Too Verbose Key: PARQUET-1758 URL: https://issues.apache.org/jira/browse/PARQUET-1758 Project: Parquet Issue Type: Improvement Reporter: David Mollitor Assignee: David Mollitor A low-level library like Parquet should be pretty quiet. It should just do its work and keep quiet. Most issues should be addressed by throwing Exceptions, and the occasional warning message otherwise it will clutter the logging for the top-level application. If debugging is required, administrator can enable it for the specific workload. *Warning:* This is my opinion. No stats to back it up. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (PARQUET-1756) Remove Dependency on Maven Plugin semantic-versioning
[ https://issues.apache.org/jira/browse/PARQUET-1756?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Mollitor updated PARQUET-1756: Summary: Remove Dependency on Maven Plugin semantic-versioning (was: Remove References to Maven Plugin semantic-versioning) > Remove Dependency on Maven Plugin semantic-versioning > - > > Key: PARQUET-1756 > URL: https://issues.apache.org/jira/browse/PARQUET-1756 > Project: Parquet > Issue Type: Improvement > Reporter: David Mollitor > Assignee: David Mollitor >Priority: Minor > > https://github.com/jeluard/semantic-versioning > According to their github page: > {quote} > This library is in dormant state and won't add any new feature. > {quote} > Also, looking at their README file, it looks like the Parquet library is > including their library in the Maven build process, but is not actually > calling it. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (PARQUET-1757) Upgrade Apache POM Parent Version to 22
David Mollitor created PARQUET-1757: --- Summary: Upgrade Apache POM Parent Version to 22 Key: PARQUET-1757 URL: https://issues.apache.org/jira/browse/PARQUET-1757 Project: Parquet Issue Type: Improvement Reporter: David Mollitor Assignee: David Mollitor -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (PARQUET-1756) Remove References to Maven Plugin semantic-versioning
David Mollitor created PARQUET-1756: --- Summary: Remove References to Maven Plugin semantic-versioning Key: PARQUET-1756 URL: https://issues.apache.org/jira/browse/PARQUET-1756 Project: Parquet Issue Type: Improvement Reporter: David Mollitor Assignee: David Mollitor https://github.com/jeluard/semantic-versioning According to their github page: {quote} This library is in dormant state and won't add any new feature. {quote} Also, looking at their README file, it looks like the Parquet library is including their library in the Maven build process, but is not actually calling it. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (PARQUET-1755) Remove slf4j-simple From parquet-benchmarks Module
[ https://issues.apache.org/jira/browse/PARQUET-1755?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Mollitor updated PARQUET-1755: Summary: Remove slf4j-simple From parquet-benchmarks Module (was: Module parquet-benchmarks Ships With slf4j-simple) > Remove slf4j-simple From parquet-benchmarks Module > -- > > Key: PARQUET-1755 > URL: https://issues.apache.org/jira/browse/PARQUET-1755 > Project: Parquet > Issue Type: Improvement > Reporter: David Mollitor > Assignee: David Mollitor >Priority: Minor > > The {{parquet-benchmarks}} module ships with the Log4J logger and the SLF4J > "simple" logger. Since this is a stand-alone application and needs Log4J, > there is no reason to also use the "simple" logger. > {code:none} > ### parquet-benchmarks > [1;34mINFO] Including org.slf4j:slf4j-simple:jar:1.7.22 in the shaded jar. > [1;34mINFO] Including org.slf4j:slf4j-api:jar:1.7.22 in the shaded jar. > [1;34mINFO] Including log4j:log4j:jar:1.2.17 in the shaded jar. > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (PARQUET-1755) Module parquet-benchmarks Ships With slf4j-simple
David Mollitor created PARQUET-1755: --- Summary: Module parquet-benchmarks Ships With slf4j-simple Key: PARQUET-1755 URL: https://issues.apache.org/jira/browse/PARQUET-1755 Project: Parquet Issue Type: Improvement Reporter: David Mollitor Assignee: David Mollitor The {{parquet-benchmarks}} module ships with the Log4J logger and the SLF4J "simple" logger. Since this is a stand-alone application and needs Log4J, there is no reason to also use the "simple" logger. {code:none} ### parquet-benchmarks [1;34mINFO] Including org.slf4j:slf4j-simple:jar:1.7.22 in the shaded jar. [1;34mINFO] Including org.slf4j:slf4j-api:jar:1.7.22 in the shaded jar. [1;34mINFO] Including log4j:log4j:jar:1.2.17 in the shaded jar. {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (PARQUET-1754) Include SLF4J Logger For parquet-format-structures Tests
David Mollitor created PARQUET-1754: --- Summary: Include SLF4J Logger For parquet-format-structures Tests Key: PARQUET-1754 URL: https://issues.apache.org/jira/browse/PARQUET-1754 Project: Parquet Issue Type: Improvement Reporter: David Mollitor Assignee: David Mollitor {code:none} ### /home/apache/parquet/parquet-mr/parquet-format-structures --- T E S T S --- Running org.apache.parquet.format.TestUtil SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder". SLF4J: Defaulting to no-operation (NOP) logger implementation SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details. {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (PARQUET-1753) Ensure Parquet Version slf4j Libraries Are Included In parquet-thrift Module
[ https://issues.apache.org/jira/browse/PARQUET-1753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Mollitor updated PARQUET-1753: Summary: Ensure Parquet Version slf4j Libraries Are Included In parquet-thrift Module (was: Ensure Parquet Version slf4j Libraries Are Included) > Ensure Parquet Version slf4j Libraries Are Included In parquet-thrift Module > > > Key: PARQUET-1753 > URL: https://issues.apache.org/jira/browse/PARQUET-1753 > Project: Parquet > Issue Type: Improvement > Reporter: David Mollitor > Assignee: David Mollitor >Priority: Minor > > {code:none} > ### parquet-thrift > [1;34mINFO] Excluding com.google.code.findbugs:jsr305:jar:3.0.0 from the > shaded jar. > [1;34mINFO] Excluding com.twitter.elephantbird:elephant-bird-core:jar:4.4 > from the shaded jar. > [1;34mINFO] Excluding > com.twitter.elephantbird:elephant-bird-hadoop-compat:jar:4.4 from the shaded > jar. > ***[1;34mINFO] Excluding org.slf4j:slf4j-api:jar:1.6.4 from the shaded jar.*** > [1;34mINFO] Excluding commons-lang:commons-lang:jar:2.4 from the shaded jar. > [1;34mINFO] Excluding com.google.guava:guava:jar:11.0.1 from the shaded jar. > {code} > You can see that slf4j-api is version *1.6.4*. All other parquet modules are > using *1.7.x*. > 1.6.4 is being brought in by some old dependencies (primarily > {{com.twitter.elephantbird}}). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (PARQUET-1753) Ensure Parquet Version slf4j Libraries Are Included
David Mollitor created PARQUET-1753: --- Summary: Ensure Parquet Version slf4j Libraries Are Included Key: PARQUET-1753 URL: https://issues.apache.org/jira/browse/PARQUET-1753 Project: Parquet Issue Type: Improvement Reporter: David Mollitor Assignee: David Mollitor {code:none} ### parquet-thrift [1;34mINFO] Excluding com.google.code.findbugs:jsr305:jar:3.0.0 from the shaded jar. [1;34mINFO] Excluding com.twitter.elephantbird:elephant-bird-core:jar:4.4 from the shaded jar. [1;34mINFO] Excluding com.twitter.elephantbird:elephant-bird-hadoop-compat:jar:4.4 from the shaded jar. ***[1;34mINFO] Excluding org.slf4j:slf4j-api:jar:1.6.4 from the shaded jar.*** [1;34mINFO] Excluding commons-lang:commons-lang:jar:2.4 from the shaded jar. [1;34mINFO] Excluding com.google.guava:guava:jar:11.0.1 from the shaded jar. {code} You can see that slf4j-api is version *1.6.4*. All other parquet modules are using *1.7.x*. 1.6.4 is being brought in by some old dependencies (primarily {{com.twitter.elephantbird}}). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (PARQUET-1752) Remove slf4j-log4j12 Binding from parquet-protobuf Module
David Mollitor created PARQUET-1752: --- Summary: Remove slf4j-log4j12 Binding from parquet-protobuf Module Key: PARQUET-1752 URL: https://issues.apache.org/jira/browse/PARQUET-1752 Project: Parquet Issue Type: Improvement Reporter: David Mollitor Assignee: David Mollitor {code:none} Running org.apache.parquet.proto.ProtoInputOutputFormatTest SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/m2/org/slf4j/slf4j-log4j12/1.7.10/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/m2/org/slf4j/slf4j-simple/1.7.22/slf4j-simple-1.7.22.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory] {code} There are two bindings being included and it produces this warning. This is also a log4j properties file in the test resources, but all it does is produce logging to the console. Just stick with the {{slf4j-simple}} logger for testing. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (PARQUET-1752) Remove slf4j-log4j12 Binding from parquet-protobuf Module
[ https://issues.apache.org/jira/browse/PARQUET-1752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Mollitor updated PARQUET-1752: Description: {code:none} Running org.apache.parquet.proto.ProtoInputOutputFormatTest SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/m2/org/slf4j/slf4j-log4j12/1.7.10/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/m2/org/slf4j/slf4j-simple/1.7.22/slf4j-simple-1.7.22.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory] {code} There are two bindings being included and it produces this warning. {{slf4j-log4j12}} is coming in as a transient dependency. There is also a log4j properties file in the test resources, but all it does is produce logging to the console. Just stick with the {{slf4j-simple}} logger for testing (which is already explicitly specified for testing) was: {code:none} Running org.apache.parquet.proto.ProtoInputOutputFormatTest SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/m2/org/slf4j/slf4j-log4j12/1.7.10/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/m2/org/slf4j/slf4j-simple/1.7.22/slf4j-simple-1.7.22.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory] {code} There are two bindings being included and it produces this warning. This is also a log4j properties file in the test resources, but all it does is produce logging to the console. Just stick with the {{slf4j-simple}} logger for testing. > Remove slf4j-log4j12 Binding from parquet-protobuf Module > - > > Key: PARQUET-1752 > URL: https://issues.apache.org/jira/browse/PARQUET-1752 > Project: Parquet > Issue Type: Improvement > Reporter: David Mollitor > Assignee: David Mollitor >Priority: Minor > > {code:none} > Running org.apache.parquet.proto.ProtoInputOutputFormatTest > SLF4J: Class path contains multiple SLF4J bindings. > SLF4J: Found binding in > [jar:file:/m2/org/slf4j/slf4j-log4j12/1.7.10/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class] > SLF4J: Found binding in > [jar:file:/m2/org/slf4j/slf4j-simple/1.7.22/slf4j-simple-1.7.22.jar!/org/slf4j/impl/StaticLoggerBinder.class] > SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an > explanation. > SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory] > {code} > There are two bindings being included and it produces this warning. > {{slf4j-log4j12}} is coming in as a transient dependency. There is also a > log4j properties file in the test resources, but all it does is produce > logging to the console. Just stick with the {{slf4j-simple}} logger for > testing (which is already explicitly specified for testing) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (PARQUET-1751) Fix Protobuf Build Warning
David Mollitor created PARQUET-1751: --- Summary: Fix Protobuf Build Warning Key: PARQUET-1751 URL: https://issues.apache.org/jira/browse/PARQUET-1751 Project: Parquet Issue Type: Improvement Reporter: David Mollitor {code:none} [libprotobuf WARNING google/protobuf/compiler/parser.cc:546] No syntax specified for the proto file: TestProtobuf.proto. Please use 'syntax = "proto2";' or 'syntax = "proto3";' to specify a syntax version. (Defaulted to proto2 syntax.) {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (PARQUET-1751) Fix Protobuf Build Warning
[ https://issues.apache.org/jira/browse/PARQUET-1751?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Mollitor reassigned PARQUET-1751: --- Assignee: David Mollitor > Fix Protobuf Build Warning > -- > > Key: PARQUET-1751 > URL: https://issues.apache.org/jira/browse/PARQUET-1751 > Project: Parquet > Issue Type: Improvement > Reporter: David Mollitor > Assignee: David Mollitor >Priority: Trivial > > {code:none} > [libprotobuf WARNING google/protobuf/compiler/parser.cc:546] No syntax > specified for the proto file: TestProtobuf.proto. Please use 'syntax = > "proto2";' or 'syntax = "proto3";' to specify a syntax version. (Defaulted to > proto2 syntax.) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (PARQUET-1750) Reduce Memory Usage of RowRanges Class
David Mollitor created PARQUET-1750: --- Summary: Reduce Memory Usage of RowRanges Class Key: PARQUET-1750 URL: https://issues.apache.org/jira/browse/PARQUET-1750 Project: Parquet Issue Type: Improvement Reporter: David Mollitor Assignee: David Mollitor {{RowRanges}} maintains an internal {{ArrayList}} with a default capacity (10). However, sometimes it is known ahead of time that only a single instance of {{Range}} will be added. For these cases, do not instantiate an {{ArrayList}} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (PARQUET-1749) Use Java 8 Streams for Empty PrimitiveIterator
David Mollitor created PARQUET-1749: --- Summary: Use Java 8 Streams for Empty PrimitiveIterator Key: PARQUET-1749 URL: https://issues.apache.org/jira/browse/PARQUET-1749 Project: Parquet Issue Type: Improvement Reporter: David Mollitor Assignee: David Mollitor -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (PARQUET-1737) Replace Test Class RandomStr with Apache Commons Lang
David Mollitor created PARQUET-1737: --- Summary: Replace Test Class RandomStr with Apache Commons Lang Key: PARQUET-1737 URL: https://issues.apache.org/jira/browse/PARQUET-1737 Project: Parquet Issue Type: Improvement Reporter: David Mollitor Assignee: David Mollitor -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (PARQUET-1736) Use StringBuilder instead of StringBuffer
David Mollitor created PARQUET-1736: --- Summary: Use StringBuilder instead of StringBuffer Key: PARQUET-1736 URL: https://issues.apache.org/jira/browse/PARQUET-1736 Project: Parquet Issue Type: Improvement Reporter: David Mollitor Assignee: David Mollitor StringBuffer is synchronized and therefore incurs the overhead even when it's not being used in a multi-threaded way. Use the unsynchronized StringBuilder instead. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (PARQUET-1735) Clean Up parquet-columns Module
David Mollitor created PARQUET-1735: --- Summary: Clean Up parquet-columns Module Key: PARQUET-1735 URL: https://issues.apache.org/jira/browse/PARQUET-1735 Project: Parquet Issue Type: Improvement Reporter: David Mollitor Assignee: David Mollitor {code:none} Remove unused imports Remove unused local variables Add missing '@Override' annotations Add missing '@Override' annotations to implementations of interface methods Add missing '@Deprecated' annotations Remove unnecessary casts Remove redundant semicolons Remove unnecessary '$NON-NLS$' tags Remove redundant type arguments {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (PARQUET-1732) Call toArray With Empty Array
[ https://issues.apache.org/jira/browse/PARQUET-1732?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Mollitor updated PARQUET-1732: Description: [https://stackoverflow.com/questions/9572795/convert-list-to-array-in-java] {quote}It is recommended now to use list.toArray(new Foo[0]);, not list.toArray(new Foo[list.size()]);. {quote} ... less code too :) was: [https://stackoverflow.com/questions/9572795/convert-list-to-array-in-java] {quote} It is recommended now to use list.toArray(new Foo[0]);, not list.toArray(new Foo[list.size()]);. {quote} > Call toArray With Empty Array > - > > Key: PARQUET-1732 > URL: https://issues.apache.org/jira/browse/PARQUET-1732 > Project: Parquet > Issue Type: Improvement > Reporter: David Mollitor > Assignee: David Mollitor >Priority: Minor > > [https://stackoverflow.com/questions/9572795/convert-list-to-array-in-java] > > {quote}It is recommended now to use list.toArray(new Foo[0]);, not > list.toArray(new Foo[list.size()]);. > {quote} > ... less code too :) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (PARQUET-1732) Call toArray With Empty Array
David Mollitor created PARQUET-1732: --- Summary: Call toArray With Empty Array Key: PARQUET-1732 URL: https://issues.apache.org/jira/browse/PARQUET-1732 Project: Parquet Issue Type: Improvement Reporter: David Mollitor Assignee: David Mollitor [https://stackoverflow.com/questions/9572795/convert-list-to-array-in-java] {quote} It is recommended now to use list.toArray(new Foo[0]);, not list.toArray(new Foo[list.size()]);. {quote} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (PARQUET-1731) Use JDK 8 Facilities to Simplify FilteringRecordMaterializer
David Mollitor created PARQUET-1731: --- Summary: Use JDK 8 Facilities to Simplify FilteringRecordMaterializer Key: PARQUET-1731 URL: https://issues.apache.org/jira/browse/PARQUET-1731 Project: Parquet Issue Type: Improvement Reporter: David Mollitor Assignee: David Mollitor -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (PARQUET-1729) Avoid AutoBoxing in EncodingStats
David Mollitor created PARQUET-1729: --- Summary: Avoid AutoBoxing in EncodingStats Key: PARQUET-1729 URL: https://issues.apache.org/jira/browse/PARQUET-1729 Project: Parquet Issue Type: Improvement Reporter: David Mollitor Assignee: David Mollitor Use AtomicInteger instead of a Java immutable Integer type which must be un-boxed, and re-boxed each time. [https://www.programcreek.com/2013/10/efficient-counter-in-java/] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (PARQUET-1728) Simplify NullPointerException Handling in AvroWriteSupport
[ https://issues.apache.org/jira/browse/PARQUET-1728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Mollitor updated PARQUET-1728: Summary: Simplify NullPointerException Handling in AvroWriteSupport (was: Simplify Handle NullPointerException Handling in AvroWriteSupport) > Simplify NullPointerException Handling in AvroWriteSupport > -- > > Key: PARQUET-1728 > URL: https://issues.apache.org/jira/browse/PARQUET-1728 > Project: Parquet > Issue Type: Improvement > Reporter: David Mollitor > Assignee: David Mollitor >Priority: Minor > > * Use Java Collection API to simplify > * Remove new-line character from logging to play nice with 'grep' -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (PARQUET-1728) Simplify Handle NullPointerException Handling in AvroWriteSupport
David Mollitor created PARQUET-1728: --- Summary: Simplify Handle NullPointerException Handling in AvroWriteSupport Key: PARQUET-1728 URL: https://issues.apache.org/jira/browse/PARQUET-1728 Project: Parquet Issue Type: Improvement Reporter: David Mollitor Assignee: David Mollitor * Use Java Collection API to simplify * Remove new-line character from logging to play nice with 'grep' -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (PARQUET-1727) Do Not Swallow InterruptedException in ParquetLoader
David Mollitor created PARQUET-1727: --- Summary: Do Not Swallow InterruptedException in ParquetLoader Key: PARQUET-1727 URL: https://issues.apache.org/jira/browse/PARQUET-1727 Project: Parquet Issue Type: Improvement Reporter: David Mollitor Assignee: David Mollitor -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (PARQUET-1726) Use Java 8 Multi Exception Handling
[ https://issues.apache.org/jira/browse/PARQUET-1726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Mollitor updated PARQUET-1726: Description: Simplify the code and removes lines of code > Use Java 8 Multi Exception Handling > --- > > Key: PARQUET-1726 > URL: https://issues.apache.org/jira/browse/PARQUET-1726 > Project: Parquet > Issue Type: Improvement > Reporter: David Mollitor > Assignee: David Mollitor >Priority: Minor > Labels: pull-request-available > > Simplify the code and removes lines of code -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (PARQUET-1726) Use Java 8 Multi Exception Handling
David Mollitor created PARQUET-1726: --- Summary: Use Java 8 Multi Exception Handling Key: PARQUET-1726 URL: https://issues.apache.org/jira/browse/PARQUET-1726 Project: Parquet Issue Type: Improvement Reporter: David Mollitor Assignee: David Mollitor -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (PARQUET-1725) Replace Usage of Strings.join with JDK Functionality in ColumnPath Class
David Mollitor created PARQUET-1725: --- Summary: Replace Usage of Strings.join with JDK Functionality in ColumnPath Class Key: PARQUET-1725 URL: https://issues.apache.org/jira/browse/PARQUET-1725 Project: Parquet Issue Type: Improvement Reporter: David Mollitor Assignee: David Mollitor -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (PARQUET-1724) Use ConcurrentHashMap for Cache in DictionaryPageReader
David Mollitor created PARQUET-1724: --- Summary: Use ConcurrentHashMap for Cache in DictionaryPageReader Key: PARQUET-1724 URL: https://issues.apache.org/jira/browse/PARQUET-1724 Project: Parquet Issue Type: Improvement Reporter: David Mollitor Assignee: David Mollitor * Use ConcurrentHashMap for Cache in DictionaryPageReader * Use Java 1.8 APIs -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (PARQUET-1723) Read From Maps Without Using Contains
David Mollitor created PARQUET-1723: --- Summary: Read From Maps Without Using Contains Key: PARQUET-1723 URL: https://issues.apache.org/jira/browse/PARQUET-1723 Project: Parquet Issue Type: Improvement Reporter: David Mollitor Assignee: David Mollitor I see a few places with the following pattern... {code:java} if (map.contains(key)) { return map.get(key); } {code} Better to just call {{get()}} and then check the return value for 'null' to determine if the key is there. This prevents the need to traverse the {{Map}} twice,... once for {{contains}} and once for {{get}}. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (PARQUET-1710) Use Objects.requireNonNull
David Mollitor created PARQUET-1710: --- Summary: Use Objects.requireNonNull Key: PARQUET-1710 URL: https://issues.apache.org/jira/browse/PARQUET-1710 Project: Parquet Issue Type: Improvement Reporter: David Mollitor Assignee: David Mollitor https://docs.oracle.com/javase/8/docs/api/java/util/Objects.html#requireNonNull-T-java.lang.String- -- This message was sent by Atlassian Jira (v8.3.4#803005)
Re: Parquet vs. other Open Source Columnar Formats
I'm sure there are many different opinions on the matter, but in regards to Avro, I would say it is becoming more and more of a niche player. Many folks are choosing to go with Google Protobufs for RPC and Parquet/ORC for analytic workloads. On Thu, May 9, 2019 at 2:30 PM Brian Bowman wrote: > All, > > Is it fair to say that Parquet is fast becoming the dominate open source > columnar storage format? How do those of you with long-term Hadoop > experience see this? For example, is Parquet overtaking ORC and Avro? > > Thanks, > > Brian >