[jira] [Created] (PARQUET-2074) Upgrade to JDK 9+

2021-08-05 Thread David Mollitor (Jira)
David Mollitor created PARQUET-2074:
---

 Summary: Upgrade to JDK 9+
 Key: PARQUET-2074
 URL: https://issues.apache.org/jira/browse/PARQUET-2074
 Project: Parquet
  Issue Type: Improvement
Reporter: David Mollitor


Moving to JDK 9 will provide a plethora of new compares/equals capabilities on 
arrays that are all based on vectorization and implement 
{{\@IntrinsicCandidate}}

https://docs.oracle.com/javase/9/docs/api/java/util/Arrays.html



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (PARQUET-2072) Do Not Determine Both Min/Max for Binary Stats

2021-08-04 Thread David Mollitor (Jira)
David Mollitor created PARQUET-2072:
---

 Summary: Do Not Determine Both Min/Max for Binary Stats
 Key: PARQUET-2072
 URL: https://issues.apache.org/jira/browse/PARQUET-2072
 Project: Parquet
  Issue Type: Improvement
Reporter: David Mollitor
Assignee: David Mollitor


I'm looking at some benchmarking code of Apache ORC v.s. Apache Parquet and see 
that Parquet is quite a bit slower for writes (reads TBD).  Based on my 
investigation, I have noticed a significant amount of time spent in determining 
min/max for binary types.

One quick improvement is to bypass a "max" value determinization if the value 
has already been determined to be a "min".

While I'm at it, remove calls to deprecated functions.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (PARQUET-2063) Remove Compile Warnings from MemoryManager

2021-07-02 Thread David Mollitor (Jira)
David Mollitor created PARQUET-2063:
---

 Summary: Remove Compile Warnings from MemoryManager
 Key: PARQUET-2063
 URL: https://issues.apache.org/jira/browse/PARQUET-2063
 Project: Parquet
  Issue Type: Improvement
  Components: parquet-mr
Reporter: David Mollitor
Assignee: David Mollitor






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (PARQUET-2048) Deprecate BaseRecordReader

2021-05-14 Thread David Mollitor (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Mollitor updated PARQUET-2048:

Summary: Deprecate BaseRecordReader  (was: Remove BaseRecordReader)

> Deprecate BaseRecordReader
> --
>
> Key: PARQUET-2048
> URL: https://issues.apache.org/jira/browse/PARQUET-2048
> Project: Parquet
>  Issue Type: Improvement
>    Reporter: David Mollitor
>    Assignee: David Mollitor
>Priority: Minor
>
> No longer used.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (PARQUET-2048) Remove BaseRecordReader

2021-05-13 Thread David Mollitor (Jira)
David Mollitor created PARQUET-2048:
---

 Summary: Remove BaseRecordReader
 Key: PARQUET-2048
 URL: https://issues.apache.org/jira/browse/PARQUET-2048
 Project: Parquet
  Issue Type: Improvement
Reporter: David Mollitor
Assignee: David Mollitor


No longer used.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (PARQUET-2047) Clean Up Code

2021-05-13 Thread David Mollitor (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2047?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Mollitor updated PARQUET-2047:

Description: 
* Removed unused code
 * Remove unused imports
 * Add @Override annotations

Mostly throwing away superfluous stuff. Less is more.

  was:
* Removed unused code
 * Remove unused imports
 * Add \@Override annotations


> Clean Up Code
> -
>
> Key: PARQUET-2047
> URL: https://issues.apache.org/jira/browse/PARQUET-2047
> Project: Parquet
>  Issue Type: Improvement
>    Reporter: David Mollitor
>    Assignee: David Mollitor
>Priority: Minor
>
> * Removed unused code
>  * Remove unused imports
>  * Add @Override annotations
> Mostly throwing away superfluous stuff. Less is more.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (PARQUET-2047) Clean Up Code

2021-05-13 Thread David Mollitor (Jira)
David Mollitor created PARQUET-2047:
---

 Summary: Clean Up Code
 Key: PARQUET-2047
 URL: https://issues.apache.org/jira/browse/PARQUET-2047
 Project: Parquet
  Issue Type: Improvement
Reporter: David Mollitor
Assignee: David Mollitor


* Removed unused code
 * Remove unused imports
 * Add \@Override annotations



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (PARQUET-2046) Upgrade Apache POM to 23

2021-05-13 Thread David Mollitor (Jira)
David Mollitor created PARQUET-2046:
---

 Summary: Upgrade Apache POM to 23
 Key: PARQUET-2046
 URL: https://issues.apache.org/jira/browse/PARQUET-2046
 Project: Parquet
  Issue Type: Improvement
Reporter: David Mollitor
Assignee: David Mollitor






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1666) Remove Unused Modules

2021-01-06 Thread David Mollitor (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17259913#comment-17259913
 ] 

David Mollitor commented on PARQUET-1666:
-

Shouldn't this be a Parquet-MR 2.0 action?

> Remove Unused Modules 
> --
>
> Key: PARQUET-1666
> URL: https://issues.apache.org/jira/browse/PARQUET-1666
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.12.0
>Reporter: Xinli Shang
>Priority: Major
> Fix For: 1.12.0
>
>
> In the last two meetings, Ryan Blue proposed to remove some unused Parquet 
> modules. This is to open a task to track it. 
> Here are the related meeting notes for the discussion on this. 
> Remove old Parquet modules
> Hive modules - sounds good
> Scooge - Julien will reach out to twitter
> Tools - undecided - Cloudera may still use the parquet-tools according to 
> Gabor.
> Cascading - undecided
> We can change the module as deprecated as description.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1126) make it easy to read and write parquet files in java without depending on hadoop

2020-12-15 Thread David Mollitor (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17249875#comment-17249875
 ] 

David Mollitor commented on PARQUET-1126:
-

Also check out some work done (Waiting in GitHub PR) [PARQUET-1776]

> make it easy to read and write parquet files in java without depending on 
> hadoop
> 
>
> Key: PARQUET-1126
> URL: https://issues.apache.org/jira/browse/PARQUET-1126
> Project: Parquet
>  Issue Type: Improvement
>Reporter: Oscar Boykin
>Priority: Major
>
> I am happy to help with this but I'd love some guidance on:
> 1) likelihood of being accepted as a patch.
> 2) how critical it is to maintain backwards compatibility in APIs.
> For instance, we probably want to introduce a new artifact that lives under 
> the existing hadoop depending artifact, and move as much code as possible to 
> that, keeping the hadoop apis in the old artifact.
> Welcome comments on solving this issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (PARQUET-1925) Introduce Velocity Template Engine to Parquet Generator

2020-10-15 Thread David Mollitor (Jira)
David Mollitor created PARQUET-1925:
---

 Summary: Introduce Velocity Template Engine to Parquet Generator
 Key: PARQUET-1925
 URL: https://issues.apache.org/jira/browse/PARQUET-1925
 Project: Parquet
  Issue Type: New Feature
Reporter: David Mollitor
Assignee: David Mollitor


Much easier than the current setup of manually outputting the strings.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (PARQUET-1924) Do not Instantiate a New LongHashFunction

2020-10-13 Thread David Mollitor (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Mollitor updated PARQUET-1924:

Description: 
{code:java|title=XxHash.java}
/**
 * The implementation of HashFunction interface. The XxHash uses XXH64 version 
xxHash
 * with a seed of 0.
 */
public class XxHash implements HashFunction {
  @Override
  public long hashBytes(byte[] input) {
return LongHashFunction.xx(0).hashBytes(input);
  }

  @Override
  public long hashByteBuffer(ByteBuffer input) {
return LongHashFunction.xx(0).hashBytes(input);
  }
{code}

Since the seed is always zero, the {{static}} implementation provided by the 
library can be used here.

  was:
{code:java|title=XxHash.java}
|/**|
| | * The implementation of HashFunction interface. The XxHash uses XXH64 
version xxHash|
| | * with a seed of 0.|
| | */|
| |public class XxHash implements HashFunction {|
| |@Override|
| |public long hashBytes(byte[] input) {|
| |return LongHashFunction.xx(0).hashBytes(input);|
| |}|
| | |
| |@Override|
| |public long hashByteBuffer(ByteBuffer input) {|
| |return LongHashFunction.xx(0).hashBytes(input);|
| |}|


> Do not Instantiate a New LongHashFunction 
> --
>
> Key: PARQUET-1924
> URL: https://issues.apache.org/jira/browse/PARQUET-1924
> Project: Parquet
>  Issue Type: Improvement
>    Reporter: David Mollitor
>    Assignee: David Mollitor
>Priority: Minor
>
> {code:java|title=XxHash.java}
> /**
>  * The implementation of HashFunction interface. The XxHash uses XXH64 
> version xxHash
>  * with a seed of 0.
>  */
> public class XxHash implements HashFunction {
>   @Override
>   public long hashBytes(byte[] input) {
> return LongHashFunction.xx(0).hashBytes(input);
>   }
>   @Override
>   public long hashByteBuffer(ByteBuffer input) {
> return LongHashFunction.xx(0).hashBytes(input);
>   }
> {code}
> Since the seed is always zero, the {{static}} implementation provided by the 
> library can be used here.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (PARQUET-1924) Do not Instantiate a New LongHashFunction

2020-10-13 Thread David Mollitor (Jira)
David Mollitor created PARQUET-1924:
---

 Summary: Do not Instantiate a New LongHashFunction 
 Key: PARQUET-1924
 URL: https://issues.apache.org/jira/browse/PARQUET-1924
 Project: Parquet
  Issue Type: Improvement
Reporter: David Mollitor
Assignee: David Mollitor


{code:java|title=XxHash.java}
|/**|
| | * The implementation of HashFunction interface. The XxHash uses XXH64 
version xxHash|
| | * with a seed of 0.|
| | */|
| |public class XxHash implements HashFunction {|
| |@Override|
| |public long hashBytes(byte[] input) {|
| |return LongHashFunction.xx(0).hashBytes(input);|
| |}|
| | |
| |@Override|
| |public long hashByteBuffer(ByteBuffer input) {|
| |return LongHashFunction.xx(0).hashBytes(input);|
| |}|



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (PARQUET-1922) Deprecate IOExceptionUtils

2020-10-08 Thread David Mollitor (Jira)
David Mollitor created PARQUET-1922:
---

 Summary: Deprecate IOExceptionUtils
 Key: PARQUET-1922
 URL: https://issues.apache.org/jira/browse/PARQUET-1922
 Project: Parquet
  Issue Type: Improvement
Reporter: David Mollitor
Assignee: David Mollitor






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (PARQUET-1921) Use StringBuilder instead of StringBuffer

2020-10-08 Thread David Mollitor (Jira)
David Mollitor created PARQUET-1921:
---

 Summary: Use StringBuilder instead of StringBuffer
 Key: PARQUET-1921
 URL: https://issues.apache.org/jira/browse/PARQUET-1921
 Project: Parquet
  Issue Type: Improvement
  Components: parquet-mr
Reporter: David Mollitor


{code:java|title=MessageTypeParser.java}
private StringBuffer currentLine = new StringBuffer();



public String nextToken() {
  while (st.hasMoreTokens()) {
String t = st.nextToken();
if (t.equals("\n")) {
  ++ line;
  currentLine.setLength(0);
} else {
  currentLine.append(t);
}
if (!isWhitespace(t)) {
  return t;
}
  }
  throw new IllegalArgumentException("unexpected end of schema");
}
{code}

Use {{StringBuilder}} instead of {{StringBuffer}} as {{StringBuffer}} is 
synchronized (which is not required here).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (PARQUET-1918) Avoid Copy of Bytes in Protobuf BinaryWriter

2020-10-02 Thread David Mollitor (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17206414#comment-17206414
 ] 

David Mollitor edited comment on PARQUET-1918 at 10/2/20, 7:52 PM:
---

Unit tests fail with:

 

Trying to address with THRIFT-5288

 
{code:java}
java.lang.Exception: java.nio.ReadOnlyBufferException
at 
org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462)
at 
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:522)
Caused by: java.nio.ReadOnlyBufferException
at java.nio.ByteBuffer.array(ByteBuffer.java:996)
at 
shaded.parquet.org.apache.thrift.protocol.TCompactProtocol.writeBinary(TCompactProtocol.java:375)
at 
org.apache.parquet.format.InterningProtocol.writeBinary(InterningProtocol.java:135)
at 
org.apache.parquet.format.ColumnIndex$ColumnIndexStandardScheme.write(ColumnIndex.java:945)
at 
org.apache.parquet.format.ColumnIndex$ColumnIndexStandardScheme.write(ColumnIndex.java:820)
at org.apache.parquet.format.ColumnIndex.write(ColumnIndex.java:728)
at org.apache.parquet.format.Util.write(Util.java:372)
at org.apache.parquet.format.Util.writeColumnIndex(Util.java:69)
at 
org.apache.parquet.hadoop.ParquetFileWriter.serializeColumnIndexes(ParquetFileWriter.java:1087)
at 
org.apache.parquet.hadoop.ParquetFileWriter.end(ParquetFileWriter.java:1050)
 {code}


was (Author: belugabehr):
Unit tests fail with:

 
{code:java}
java.lang.Exception: java.nio.ReadOnlyBufferException
at 
org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462)
at 
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:522)
Caused by: java.nio.ReadOnlyBufferException
at java.nio.ByteBuffer.array(ByteBuffer.java:996)
at 
shaded.parquet.org.apache.thrift.protocol.TCompactProtocol.writeBinary(TCompactProtocol.java:375)
at 
org.apache.parquet.format.InterningProtocol.writeBinary(InterningProtocol.java:135)
at 
org.apache.parquet.format.ColumnIndex$ColumnIndexStandardScheme.write(ColumnIndex.java:945)
at 
org.apache.parquet.format.ColumnIndex$ColumnIndexStandardScheme.write(ColumnIndex.java:820)
at org.apache.parquet.format.ColumnIndex.write(ColumnIndex.java:728)
at org.apache.parquet.format.Util.write(Util.java:372)
at org.apache.parquet.format.Util.writeColumnIndex(Util.java:69)
at 
org.apache.parquet.hadoop.ParquetFileWriter.serializeColumnIndexes(ParquetFileWriter.java:1087)
at 
org.apache.parquet.hadoop.ParquetFileWriter.end(ParquetFileWriter.java:1050)
 {code}

> Avoid Copy of Bytes in Protobuf BinaryWriter
> 
>
> Key: PARQUET-1918
> URL: https://issues.apache.org/jira/browse/PARQUET-1918
> Project: Parquet
>  Issue Type: Improvement
>    Reporter: David Mollitor
>    Assignee: David Mollitor
>Priority: Minor
>
> {code:java|title=ProtoWriteSupport.java}
>   class BinaryWriter extends FieldWriter {
> @Override
> final void writeRawValue(Object value) {
>   ByteString byteString = (ByteString) value;
>   Binary binary = Binary.fromConstantByteArray(byteString.toByteArray());
>   recordConsumer.addBinary(binary);
> }
>   }
> {code}
> {{toByteArray()}} creates a copy of the buffer.  There is already support 
> with Parquet and Protobuf to pass instead a ByteBuffer which avoids the copy.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1918) Avoid Copy of Bytes in Protobuf BinaryWriter

2020-10-02 Thread David Mollitor (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17206414#comment-17206414
 ] 

David Mollitor commented on PARQUET-1918:
-

Unit tests fail with:

 
{code:java}
java.lang.Exception: java.nio.ReadOnlyBufferException
at 
org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462)
at 
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:522)
Caused by: java.nio.ReadOnlyBufferException
at java.nio.ByteBuffer.array(ByteBuffer.java:996)
at 
shaded.parquet.org.apache.thrift.protocol.TCompactProtocol.writeBinary(TCompactProtocol.java:375)
at 
org.apache.parquet.format.InterningProtocol.writeBinary(InterningProtocol.java:135)
at 
org.apache.parquet.format.ColumnIndex$ColumnIndexStandardScheme.write(ColumnIndex.java:945)
at 
org.apache.parquet.format.ColumnIndex$ColumnIndexStandardScheme.write(ColumnIndex.java:820)
at org.apache.parquet.format.ColumnIndex.write(ColumnIndex.java:728)
at org.apache.parquet.format.Util.write(Util.java:372)
at org.apache.parquet.format.Util.writeColumnIndex(Util.java:69)
at 
org.apache.parquet.hadoop.ParquetFileWriter.serializeColumnIndexes(ParquetFileWriter.java:1087)
at 
org.apache.parquet.hadoop.ParquetFileWriter.end(ParquetFileWriter.java:1050)
 {code}

> Avoid Copy of Bytes in Protobuf BinaryWriter
> 
>
> Key: PARQUET-1918
> URL: https://issues.apache.org/jira/browse/PARQUET-1918
> Project: Parquet
>  Issue Type: Improvement
>    Reporter: David Mollitor
>    Assignee: David Mollitor
>Priority: Minor
>
> {code:java|title=ProtoWriteSupport.java}
>   class BinaryWriter extends FieldWriter {
> @Override
> final void writeRawValue(Object value) {
>   ByteString byteString = (ByteString) value;
>   Binary binary = Binary.fromConstantByteArray(byteString.toByteArray());
>   recordConsumer.addBinary(binary);
> }
>   }
> {code}
> {{toByteArray()}} creates a copy of the buffer.  There is already support 
> with Parquet and Protobuf to pass instead a ByteBuffer which avoids the copy.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Moved] (PARQUET-1918) Avoid Copy of Bytes in Protobuf BinaryWriter

2020-10-02 Thread David Mollitor (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Mollitor moved HIVE-24226 to PARQUET-1918:


 Key: PARQUET-1918  (was: HIVE-24226)
Workflow: patch-available, re-open possible  (was: no-reopen-closed, 
patch-avail)
 Project: Parquet  (was: Hive)

> Avoid Copy of Bytes in Protobuf BinaryWriter
> 
>
> Key: PARQUET-1918
> URL: https://issues.apache.org/jira/browse/PARQUET-1918
> Project: Parquet
>  Issue Type: Improvement
>    Reporter: David Mollitor
>    Assignee: David Mollitor
>Priority: Minor
>
> {code:java|title=ProtoWriteSupport.java}
>   class BinaryWriter extends FieldWriter {
> @Override
> final void writeRawValue(Object value) {
>   ByteString byteString = (ByteString) value;
>   Binary binary = Binary.fromConstantByteArray(byteString.toByteArray());
>   recordConsumer.addBinary(binary);
> }
>   }
> {code}
> {{toByteArray()}} creates a copy of the buffer.  There is already support 
> with Parquet and Protobuf to pass instead a ByteBuffer which avoids the copy.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1914) Allow ProtoParquetReader To Support InputFile

2020-09-21 Thread David Mollitor (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17199592#comment-17199592
 ] 

David Mollitor commented on PARQUET-1914:
-

{{ProtoParquetReader.Builder}} should extends {{ParquetReader.Builder}} to 
correctly override {{getReadSupport()}} and to allow for using an {{InputFile}} 
in addition to the previously supported {{Path}}.

The usage pattern here is a bit confusing and I wanted to update the 
{{ParquetReader.Builder}} directly, but I think this is the way it is intended.

> Allow ProtoParquetReader To Support InputFile
> -
>
> Key: PARQUET-1914
> URL: https://issues.apache.org/jira/browse/PARQUET-1914
> Project: Parquet
>  Issue Type: Improvement
>    Reporter: David Mollitor
>    Assignee: David Mollitor
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (PARQUET-1913) ParquetReader Should Support InputFile

2020-09-21 Thread David Mollitor (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Mollitor resolved PARQUET-1913.
-
Resolution: Won't Fix

> ParquetReader Should Support InputFile
> --
>
> Key: PARQUET-1913
> URL: https://issues.apache.org/jira/browse/PARQUET-1913
> Project: Parquet
>  Issue Type: Improvement
>    Reporter: David Mollitor
>    Assignee: David Mollitor
>Priority: Major
>
> When creating a {{ParquetReader}}, a "read support" object is required.
> https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetReader.java#L325-L330
> However, when building from an {{InputFile}}, 'readSupport' is always 'null' 
> and therefore will never work.
> https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetReader.java#L202
> Add the read support option just as is done with a {{Path}} object.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (PARQUET-1914) Allow ProtoParquetReader To Support InputFile

2020-09-21 Thread David Mollitor (Jira)
David Mollitor created PARQUET-1914:
---

 Summary: Allow ProtoParquetReader To Support InputFile
 Key: PARQUET-1914
 URL: https://issues.apache.org/jira/browse/PARQUET-1914
 Project: Parquet
  Issue Type: Improvement
Reporter: David Mollitor
Assignee: David Mollitor






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (PARQUET-1913) ParquetReader Should Support InputFile

2020-09-21 Thread David Mollitor (Jira)
David Mollitor created PARQUET-1913:
---

 Summary: ParquetReader Should Support InputFile
 Key: PARQUET-1913
 URL: https://issues.apache.org/jira/browse/PARQUET-1913
 Project: Parquet
  Issue Type: Improvement
Reporter: David Mollitor
Assignee: David Mollitor


When creating a {{ParquetReader}}, a "read support" object is required.

https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetReader.java#L325-L330

However, when building from an {{InputFile}}, 'readSupport' is always 'null' 
and therefore will never work.

https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetReader.java#L202

Add the read support option just as is done with a {{Path}} object.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1822) Parquet without Hadoop dependencies

2020-09-01 Thread David Mollitor (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17188573#comment-17188573
 ] 

David Mollitor commented on PARQUET-1822:
-

Parquet 2.0 anyone?

> Parquet without Hadoop dependencies
> ---
>
> Key: PARQUET-1822
> URL: https://issues.apache.org/jira/browse/PARQUET-1822
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-avro
>Affects Versions: 1.11.0
> Environment: Amazon Fargate (linux), Windows development box.
> We are writing Parquet to be read by the Snowflake and Athena databases.
>Reporter: mark juchems
>Priority: Minor
>  Labels: documentation, newbie
>
> I have been trying for weeks to create a parquet file from avro and write to 
> S3 in Java.  This has been incredibly frustrating and odd as Spark can do it 
> easily (I'm told).
> I have assembled the correct jars through luck and diligence, but now I find 
> out that I have to have hadoop installed on my machine. I am currently 
> developing in Windows and it seems a dll and exe can fix that up but am 
> wondering about Linus as the code will eventually run in Fargate on AWS.
> *Why do I need external dependencies and not pure java?*
> The thing really is how utterly complex all this is.  I would like to create 
> an avro file and convert it to Parquet and write it to S3, but I am trapped 
> in "ParquetWriter" hell! 
> *Why can't I get a normal OutputStream and write it wherever I want?*
> I have scoured the web for examples and there are a few but we really need 
> some documentation on this stuff.  I understand that there may be reasons for 
> all this but I can't find them on the web anywhere.  Any help?  Can't we get 
> the "SimpleParquet" jar that does this:
>  
> ParquetWriter writer = 
> AvroParquetWriter.builder(outputStream)
>  .withSchema(avroSchema)
>  .withConf(conf)
>  .withCompressionCodec(CompressionCodecName.SNAPPY)
>  .withWriteMode(Mode.OVERWRITE)//probably not good for prod. (overwrites 
> files).
>  .build();
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1905) Use SeekableByteChannel instead of OutputFile/InputFile Classes

2020-08-27 Thread David Mollitor (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17186108#comment-17186108
 ] 

David Mollitor commented on PARQUET-1905:
-

Also gets rid of {{PositionOutputStream}}

> Use SeekableByteChannel instead of OutputFile/InputFile Classes
> ---
>
> Key: PARQUET-1905
> URL: https://issues.apache.org/jira/browse/PARQUET-1905
> Project: Parquet
>  Issue Type: Improvement
>    Reporter: David Mollitor
>Priority: Major
> Fix For: 2.0.0
>
>
> Use Java NIO SeekableByteChannel for input to reader/writer instead of the 
> current Parquet-only {{Output}}/{{InputFile}} Classes



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (PARQUET-1905) Use SeekableByteChannel instead of OutputFile/InputFile Classes

2020-08-27 Thread David Mollitor (Jira)
David Mollitor created PARQUET-1905:
---

 Summary: Use SeekableByteChannel instead of OutputFile/InputFile 
Classes
 Key: PARQUET-1905
 URL: https://issues.apache.org/jira/browse/PARQUET-1905
 Project: Parquet
  Issue Type: Improvement
Reporter: David Mollitor
 Fix For: 2.0.0


Use Java NIO SeekableByteChannel instead of the current Parquet-only 
{{Output}}/{{InputFile}} Classes



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (PARQUET-1905) Use SeekableByteChannel instead of OutputFile/InputFile Classes

2020-08-27 Thread David Mollitor (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Mollitor updated PARQUET-1905:

Description: Use Java NIO SeekableByteChannel for input to reader/writer 
instead of the current Parquet-only {{Output}}/{{InputFile}} Classes  (was: Use 
Java NIO SeekableByteChannel instead of the current Parquet-only 
{{Output}}/{{InputFile}} Classes)

> Use SeekableByteChannel instead of OutputFile/InputFile Classes
> ---
>
> Key: PARQUET-1905
> URL: https://issues.apache.org/jira/browse/PARQUET-1905
> Project: Parquet
>  Issue Type: Improvement
>    Reporter: David Mollitor
>Priority: Major
> Fix For: 2.0.0
>
>
> Use Java NIO SeekableByteChannel for input to reader/writer instead of the 
> current Parquet-only {{Output}}/{{InputFile}} Classes



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (PARQUET-1903) Improve Parquet Protobuf Usability

2020-08-27 Thread David Mollitor (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Mollitor updated PARQUET-1903:

Description: 
Check out the PR for details.

 
 * Move away from passing around a {{Class}} object to take advantage of Java 
Templating
 * Make parquet-proto library more usable and straight-forward
 * Provide test examples
 * Limited support for protocol buffer schema registry

 

  was:
Check out the PR for details.

 
 * Move away from passing around a {{Class}} object to take advantage of Java 
Templating
 * Make parquet-proto library more usable and straight-forward
 * Provide test examples

 


> Improve Parquet Protobuf Usability
> --
>
> Key: PARQUET-1903
> URL: https://issues.apache.org/jira/browse/PARQUET-1903
> Project: Parquet
>  Issue Type: Improvement
>    Reporter: David Mollitor
>    Assignee: David Mollitor
>Priority: Major
>
> Check out the PR for details.
>  
>  * Move away from passing around a {{Class}} object to take advantage of Java 
> Templating
>  * Make parquet-proto library more usable and straight-forward
>  * Provide test examples
>  * Limited support for protocol buffer schema registry
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (PARQUET-1903) Improve Parquet Protobuf Usability

2020-08-26 Thread David Mollitor (Jira)
David Mollitor created PARQUET-1903:
---

 Summary: Improve Parquet Protobuf Usability
 Key: PARQUET-1903
 URL: https://issues.apache.org/jira/browse/PARQUET-1903
 Project: Parquet
  Issue Type: Improvement
Reporter: David Mollitor
Assignee: David Mollitor


Check out the PR for details.

 
 * Move away from passing around a {{Class}} object to take advantage of Java 
Templating
 * Make parquet-proto library more usable and straight-forward
 * Provide test examples

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: Finding Max Value of Column

2020-03-10 Thread David Mollitor
Hey Gabor,

I appreciate you sharing your knowledge with me.

As I understand it, my solution is acceptable but is not the generalized
solution.  What would that solution look like?

Thanks.

On Tue, Mar 10, 2020, 4:55 AM Gabor Szadovszky
 wrote:

> Hi,
>
> Statistics objects are mainly created for internal use. The check you
> mentioned is to ensure that only the corresponding column statistics are
> summarized.
> The code you've written works properly because you create and use the
> Statistics object as we use it internally. However, it is quite easy to
> misuse it.
> It is also worth mentioning that the code works properly because your type
> is an INT64. In case of some other types (e.g. FLOAT, DOUBLE, BINARY) it
> would not always be that trivial.
> So, if this code works for your case you may use it but I would not suggest
> generalizing it for other cases and neither would suggest extending the
> existing code to support it.
>
> Regards,
> Gabor
>
> On Mon, Mar 9, 2020 at 4:12 PM David Mollitor  wrote:
>
> > Hello,
> >
> > One thing that would have made this even easier... the 'mergeStatsistics'
> > method throws an exception if the columns are not equal on the RHS/LHS of
> > the method.  I had to add that toDotString check to avoid this
> scenario.  I
> > could have just caught (and ignored) that exception to remove that extra
> > check, but the overhead would have been heavy, and it would have added
> even
> > more code.
> >
> > The 'mergeStatistics' method is already doing a comparison check
> internally
> > (that's why it throws an exception),  is there any interest in adding a
> new
> > method signature that returns true/false if the merge was successful,
> > instead of throwing an exception?
> >
> > Then the code just becomes:
> >
> > for (final BlockMetaData rowGroup : reader.getRowGroups()) {
> >   for (final ColumnChunkMetaData column : rowGroup.getColumns()) {
> > boolean success =
> > stats.mergeStatistics(column.getStatistics());
> >   }
> > }
> >
> >
> >
> > On Mon, Mar 9, 2020 at 10:58 AM Gabor Szadovszky
> >  wrote:
> >
> > > Hi David,
> > >
> > > Your code looks good to me. As you are using INT64, min/max truncate
> does
> > > not apply. I think, it should work fine.
> > >
> > > Cheers,
> > > Gabor
> > >
> > > On Mon, Mar 9, 2020 at 3:42 PM David Mollitor 
> wrote:
> > >
> > > > Hello Gang,
> > > >
> > > > I am trying to build an application.  One function it has is to scan
> a
> > > > directory of Parquet files and then determine the maximum "sequence
> > > number"
> > > > (id) across all files.  This is the solution I came up with, but is
> > this
> > > > correct?  How would you do such a thing?
> > > >
> > > > I wrote the files with parquet-avro writer.
> > > >
> > > > try (DirectoryStream directoryStream =
> > > > Files.newDirectoryStream(Paths.get("/tmp/parq-files"), filter)) {
> > > >
> > > >   PrimitiveType type =
> > > > Types.required(PrimitiveTypeName.INT64).named("seq");
> > > >   Statistics stats =
> Statistics.getBuilderForReading(type).build();
> > > >
> > > >   for (java.nio.file.Path path : directoryStream) {
> > > > ParquetFileReader reader =
> > > > ParquetFileReader.open(HadoopInputFile.fromPath(new
> Path(path.toUri()),
> > > new
> > > > Configuration()));
> > > >
> > > > for (final BlockMetaData rowGroup : reader.getRowGroups()) {
> > > >   for (final ColumnChunkMetaData column : rowGroup.getColumns())
> {
> > > > if ("seq".equals(column.getPath().toDotString())) {
> > > >   stats.mergeStatistics(column.getStatistics());
> > > > }
> > > >   }
> > > >}
> > > > }
> > > >
> > > > Thanks.
> > > >
> > >
> >
>


Re: Finding Max Value of Column

2020-03-09 Thread David Mollitor
Hello,

One thing that would have made this even easier... the 'mergeStatsistics'
method throws an exception if the columns are not equal on the RHS/LHS of
the method.  I had to add that toDotString check to avoid this scenario.  I
could have just caught (and ignored) that exception to remove that extra
check, but the overhead would have been heavy, and it would have added even
more code.

The 'mergeStatistics' method is already doing a comparison check internally
(that's why it throws an exception),  is there any interest in adding a new
method signature that returns true/false if the merge was successful,
instead of throwing an exception?

Then the code just becomes:

for (final BlockMetaData rowGroup : reader.getRowGroups()) {
  for (final ColumnChunkMetaData column : rowGroup.getColumns()) {
boolean success = stats.mergeStatistics(column.getStatistics());
  }
}



On Mon, Mar 9, 2020 at 10:58 AM Gabor Szadovszky
 wrote:

> Hi David,
>
> Your code looks good to me. As you are using INT64, min/max truncate does
> not apply. I think, it should work fine.
>
> Cheers,
> Gabor
>
> On Mon, Mar 9, 2020 at 3:42 PM David Mollitor  wrote:
>
> > Hello Gang,
> >
> > I am trying to build an application.  One function it has is to scan a
> > directory of Parquet files and then determine the maximum "sequence
> number"
> > (id) across all files.  This is the solution I came up with, but is this
> > correct?  How would you do such a thing?
> >
> > I wrote the files with parquet-avro writer.
> >
> > try (DirectoryStream directoryStream =
> > Files.newDirectoryStream(Paths.get("/tmp/parq-files"), filter)) {
> >
> >   PrimitiveType type =
> > Types.required(PrimitiveTypeName.INT64).named("seq");
> >   Statistics stats = Statistics.getBuilderForReading(type).build();
> >
> >   for (java.nio.file.Path path : directoryStream) {
> > ParquetFileReader reader =
> > ParquetFileReader.open(HadoopInputFile.fromPath(new Path(path.toUri()),
> new
> > Configuration()));
> >
> > for (final BlockMetaData rowGroup : reader.getRowGroups()) {
> >   for (final ColumnChunkMetaData column : rowGroup.getColumns()) {
> > if ("seq".equals(column.getPath().toDotString())) {
> >   stats.mergeStatistics(column.getStatistics());
> > }
> >   }
> >}
> > }
> >
> > Thanks.
> >
>


Finding Max Value of Column

2020-03-09 Thread David Mollitor
Hello Gang,

I am trying to build an application.  One function it has is to scan a
directory of Parquet files and then determine the maximum "sequence number"
(id) across all files.  This is the solution I came up with, but is this
correct?  How would you do such a thing?

I wrote the files with parquet-avro writer.

try (DirectoryStream directoryStream =
Files.newDirectoryStream(Paths.get("/tmp/parq-files"), filter)) {

  PrimitiveType type = Types.required(PrimitiveTypeName.INT64).named("seq");
  Statistics stats = Statistics.getBuilderForReading(type).build();

  for (java.nio.file.Path path : directoryStream) {
ParquetFileReader reader =
ParquetFileReader.open(HadoopInputFile.fromPath(new Path(path.toUri()), new
Configuration()));

for (final BlockMetaData rowGroup : reader.getRowGroups()) {
  for (final ColumnChunkMetaData column : rowGroup.getColumns()) {
if ("seq".equals(column.getPath().toDotString())) {
  stats.mergeStatistics(column.getStatistics());
}
  }
   }
}

Thanks.


[jira] [Created] (PARQUET-1782) Use Switch Statement in AvroRecordConverter

2020-02-03 Thread David Mollitor (Jira)
David Mollitor created PARQUET-1782:
---

 Summary: Use Switch Statement in AvroRecordConverter
 Key: PARQUET-1782
 URL: https://issues.apache.org/jira/browse/PARQUET-1782
 Project: Parquet
  Issue Type: Improvement
Reporter: David Mollitor
Assignee: David Mollitor






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Issue Comment Deleted] (PARQUET-1778) Do Not Consider Class for Avro Generic Record Reader

2020-02-01 Thread David Mollitor (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1778?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Mollitor updated PARQUET-1778:

Comment: was deleted

(was: I think this is an Avro issue.)

> Do Not Consider Class for Avro Generic Record Reader
> 
>
> Key: PARQUET-1778
> URL: https://issues.apache.org/jira/browse/PARQUET-1778
> Project: Parquet
>  Issue Type: Improvement
>    Reporter: David Mollitor
>    Assignee: David Mollitor
>Priority: Major
>
>  
> {code:java|title=Example Code}
> final ParquetReader reader = 
> AvroParquetReader.builder(path).build();
> final GenericRecord genericRecord = reader.read();
> {code}
> It fails with...
> {code:none}
> java.lang.NoSuchMethodException: io.github.belugabehr.app.Record.()
>   at java.lang.Class.getConstructor0(Class.java:3082) ~[na:1.8.0_232]
>   at java.lang.Class.getDeclaredConstructor(Class.java:2178) 
> ~[na:1.8.0_232]
>   at 
> org.apache.avro.specific.SpecificData$1.computeValue(SpecificData.java:63) 
> ~[avro-1.9.1.jar:1.9.1]
>   at 
> org.apache.avro.specific.SpecificData$1.computeValue(SpecificData.java:58) 
> ~[avro-1.9.1.jar:1.9.1]
>   at java.lang.ClassValue.getFromHashMap(ClassValue.java:227) 
> ~[na:1.8.0_232]
>   at java.lang.ClassValue.getFromBackup(ClassValue.java:209) 
> ~[na:1.8.0_232]
>   at java.lang.ClassValue.get(ClassValue.java:115) ~[na:1.8.0_232]
>   at 
> org.apache.avro.specific.SpecificData.newInstance(SpecificData.java:470) 
> ~[avro-1.9.1.jar:1.9.1]
>   at 
> org.apache.avro.specific.SpecificData.newRecord(SpecificData.java:491) 
> ~[avro-1.9.1.jar:1.9.1]
>   at 
> org.apache.parquet.avro.AvroRecordConverter.start(AvroRecordConverter.java:404)
>  ~[parquet-avro-1.11.0.jar:1.11.0]
>   at 
> org.apache.parquet.io.RecordReaderImplementation.read(RecordReaderImplementation.java:392)
>  ~[parquet-column-1.11.0.jar:1.11.0]
>   at 
> org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:226)
>  ~[parquet-hadoop-1.11.0.jar:1.11.0]
>   at org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:132) 
> ~[parquet-hadoop-1.11.0.jar:1.11.0]
>   at org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:136) 
> ~[parquet-hadoop-1.11.0.jar:1.11.0]
> {code}
> I was surprised because it should just load a {{GenericRecord}} view of the 
> data. But alas, I have the Avro Schema defined with the {{namespace}} and 
> {{name}} fields pointing to {{io.github.belugabehr.app.Record}} which just so 
> happens to be a real class on the class path, so it is trying to call the 
> public constructor on the class and this constructor does does not exist.  
> Regardless, the {{GenericRecordReader}} should just ignore this Avro Schema 
> namespace information.
> I am putting {{GenericRecords}} into the Parquet file, I expect to get 
> {{GenericRecords}} back out when I read it.
> If I hack the information in a Schema and change the {{namespace}} or 
> {{name}} fields to something bogus, it works as I would expect it to.  It 
> successfully reads and returns a {{GenericRecord}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (PARQUET-1778) Do Not Consider Class for Avro Generic Record Reader

2020-02-01 Thread David Mollitor (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1778?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Mollitor reassigned PARQUET-1778:
---

Assignee: David Mollitor

> Do Not Consider Class for Avro Generic Record Reader
> 
>
> Key: PARQUET-1778
> URL: https://issues.apache.org/jira/browse/PARQUET-1778
> Project: Parquet
>  Issue Type: Improvement
>    Reporter: David Mollitor
>    Assignee: David Mollitor
>Priority: Major
>
>  
> {code:java|title=Example Code}
> final ParquetReader reader = 
> AvroParquetReader.builder(path).build();
> final GenericRecord genericRecord = reader.read();
> {code}
> It fails with...
> {code:none}
> java.lang.NoSuchMethodException: io.github.belugabehr.app.Record.()
>   at java.lang.Class.getConstructor0(Class.java:3082) ~[na:1.8.0_232]
>   at java.lang.Class.getDeclaredConstructor(Class.java:2178) 
> ~[na:1.8.0_232]
>   at 
> org.apache.avro.specific.SpecificData$1.computeValue(SpecificData.java:63) 
> ~[avro-1.9.1.jar:1.9.1]
>   at 
> org.apache.avro.specific.SpecificData$1.computeValue(SpecificData.java:58) 
> ~[avro-1.9.1.jar:1.9.1]
>   at java.lang.ClassValue.getFromHashMap(ClassValue.java:227) 
> ~[na:1.8.0_232]
>   at java.lang.ClassValue.getFromBackup(ClassValue.java:209) 
> ~[na:1.8.0_232]
>   at java.lang.ClassValue.get(ClassValue.java:115) ~[na:1.8.0_232]
>   at 
> org.apache.avro.specific.SpecificData.newInstance(SpecificData.java:470) 
> ~[avro-1.9.1.jar:1.9.1]
>   at 
> org.apache.avro.specific.SpecificData.newRecord(SpecificData.java:491) 
> ~[avro-1.9.1.jar:1.9.1]
>   at 
> org.apache.parquet.avro.AvroRecordConverter.start(AvroRecordConverter.java:404)
>  ~[parquet-avro-1.11.0.jar:1.11.0]
>   at 
> org.apache.parquet.io.RecordReaderImplementation.read(RecordReaderImplementation.java:392)
>  ~[parquet-column-1.11.0.jar:1.11.0]
>   at 
> org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:226)
>  ~[parquet-hadoop-1.11.0.jar:1.11.0]
>   at org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:132) 
> ~[parquet-hadoop-1.11.0.jar:1.11.0]
>   at org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:136) 
> ~[parquet-hadoop-1.11.0.jar:1.11.0]
> {code}
> I was surprised because it should just load a {{GenericRecord}} view of the 
> data. But alas, I have the Avro Schema defined with the {{namespace}} and 
> {{name}} fields pointing to {{io.github.belugabehr.app.Record}} which just so 
> happens to be a real class on the class path, so it is trying to call the 
> public constructor on the class and this constructor does does not exist.  
> Regardless, the {{GenericRecordReader}} should just ignore this Avro Schema 
> namespace information.
> I am putting {{GenericRecords}} into the Parquet file, I expect to get 
> {{GenericRecords}} back out when I read it.
> If I hack the information in a Schema and change the {{namespace}} or 
> {{name}} fields to something bogus, it works as I would expect it to.  It 
> successfully reads and returns a {{GenericRecord}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1778) Do Not Consider Class for Avro Generic Record Reader

2020-02-01 Thread David Mollitor (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17028170#comment-17028170
 ] 

David Mollitor commented on PARQUET-1778:
-

I think this is an Avro issue.

> Do Not Consider Class for Avro Generic Record Reader
> 
>
> Key: PARQUET-1778
> URL: https://issues.apache.org/jira/browse/PARQUET-1778
> Project: Parquet
>  Issue Type: Improvement
>    Reporter: David Mollitor
>Priority: Major
>
>  
> {code:java|title=Example Code}
> final ParquetReader reader = 
> AvroParquetReader.builder(path).build();
> final GenericRecord genericRecord = reader.read();
> {code}
> It fails with...
> {code:none}
> java.lang.NoSuchMethodException: io.github.belugabehr.app.Record.()
>   at java.lang.Class.getConstructor0(Class.java:3082) ~[na:1.8.0_232]
>   at java.lang.Class.getDeclaredConstructor(Class.java:2178) 
> ~[na:1.8.0_232]
>   at 
> org.apache.avro.specific.SpecificData$1.computeValue(SpecificData.java:63) 
> ~[avro-1.9.1.jar:1.9.1]
>   at 
> org.apache.avro.specific.SpecificData$1.computeValue(SpecificData.java:58) 
> ~[avro-1.9.1.jar:1.9.1]
>   at java.lang.ClassValue.getFromHashMap(ClassValue.java:227) 
> ~[na:1.8.0_232]
>   at java.lang.ClassValue.getFromBackup(ClassValue.java:209) 
> ~[na:1.8.0_232]
>   at java.lang.ClassValue.get(ClassValue.java:115) ~[na:1.8.0_232]
>   at 
> org.apache.avro.specific.SpecificData.newInstance(SpecificData.java:470) 
> ~[avro-1.9.1.jar:1.9.1]
>   at 
> org.apache.avro.specific.SpecificData.newRecord(SpecificData.java:491) 
> ~[avro-1.9.1.jar:1.9.1]
>   at 
> org.apache.parquet.avro.AvroRecordConverter.start(AvroRecordConverter.java:404)
>  ~[parquet-avro-1.11.0.jar:1.11.0]
>   at 
> org.apache.parquet.io.RecordReaderImplementation.read(RecordReaderImplementation.java:392)
>  ~[parquet-column-1.11.0.jar:1.11.0]
>   at 
> org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:226)
>  ~[parquet-hadoop-1.11.0.jar:1.11.0]
>   at org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:132) 
> ~[parquet-hadoop-1.11.0.jar:1.11.0]
>   at org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:136) 
> ~[parquet-hadoop-1.11.0.jar:1.11.0]
> {code}
> I was surprised because it should just load a {{GenericRecord}} view of the 
> data. But alas, I have the Avro Schema defined with the {{namespace}} and 
> {{name}} fields pointing to {{io.github.belugabehr.app.Record}} which just so 
> happens to be a real class on the class path, so it is trying to call the 
> public constructor on the class and this constructor does does not exist.  
> Regardless, the {{GenericRecordReader}} should just ignore this Avro Schema 
> namespace information.
> I am putting {{GenericRecords}} into the Parquet file, I expect to get 
> {{GenericRecords}} back out when I read it.
> If I hack the information in a Schema and change the {{namespace}} or 
> {{name}} fields to something bogus, it works as I would expect it to.  It 
> successfully reads and returns a {{GenericRecord}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: Parquet Verbose Logging

2020-01-24 Thread David Mollitor
Hey Ryan,

I think you understand my position correctly and articulated it well.  My
background is from higher up the stack; a consumer of these libraries.

We may need to agree to disagree on this one.  Projects these days include
100+ libraries and I don't want to have to set a custom log level for each
one.  Much easier for consumer of libraries to keep everything as quiet as
possible and then only have to worry about a custom logging level when
something goes wrong.  Parquet in particular logs a lot of stuff at INFO
level that is very specific to Parquet and would only be useful (if at all)
to someone that really knows the library, not something that would be
helpful to the higher level application developer.

Thanks.



On Fri, Jan 24, 2020 at 6:48 PM Ryan Blue  wrote:

> It sounds like we see logging differently. My approach is that for any
> library, the type of information should be categorized using the same
> criteria into log levels. For example, if it is a normal event you might
> want to know about, use info. It looks like your approach is that the
> levels should be set for information from the perspective of the end
> application: is this behavior relevant to the end user?
>
> The problem is that you don't always know whether something is relevant to
> the end user because that context depends on the application. For the
> Parquet CLI, much more Parquet information is relevant than for Presto that
> is scanning Parquet files. That's why I think it's best to categorize the
> log information using a standard definition, and rely on the end
> application to configure log levels for its users expectations.
>
> On Fri, Jan 24, 2020 at 10:29 AM David Mollitor  wrote:
>
>> Hello Ryan,
>>
>> I appreciate you taking the time to share your thoughts.
>>
>> I'd just like to point out that there is also TRACE level logging if
>> Parquet requires greater granularity.
>>
>> Furthermore, I'm not suggesting that there be an unbreakable rule that
>> all logging must be DEBUG, but it should be the exception, not the rule.
>> It is more likely the situation the the wrapping application would be
>> responsible for logging at the INFO and WARN/ERROR level.  Something
>> like
>>
>> try {
>>LOG.info("Using Parquet to read file {}", path);
>>avroParquetReader.read();
>> } catch (Exception e) {
>>   LOG.error("Failed to read Parquet file", e);
>> }
>>
>> This is a very normal setup and doesn't require any additional logging
>> from the Parquet library itself.  Once I see an error with "Failed to re
>> Parquet file", then I'm going to turn on DEBUG logging and try to reproduce
>> the error.
>>
>> Thanks,
>> David
>>
>> On Fri, Jan 24, 2020 at 12:01 PM Ryan Blue 
>> wrote:
>>
>>> I don't agree with the idea to convert all of Parquet's logs to DEBUG
>>> level, but I do think that we can improve the levels of individual
>>> messages.
>>>
>>> If we convert all logs to debug, then turning on logs to see what Parquet
>>> is doing would show everything from opening an input file to position
>>> tracking in output files. That's way too much information, which is why
>>> we
>>> use different log levels to begin with.
>>>
>>> I think we should continue using log levels to distinguish between types
>>> of
>>> information: error for errors, warn for recoverable errors that may or
>>> may
>>> not indicate a problem, info for regular operations, and debug for extra
>>> information if you're debugging the Parquet library. Following the common
>>> convention enables people to choose what information they want instead of
>>> mixing it all together.
>>>
>>> If you want to only see error and warning logs from Parquet, then the
>>> right
>>> way to do that is to configure your logger so that the level for
>>> org.apache.parquet classes is warn. That's not to say I don't agree that
>>> we
>>> can cut down on what is logged at info and clean it up; I just don't
>>> think
>>> it's a good idea to abandon the idea of log levels to distinguish between
>>> different information the user of a library will need.
>>>
>>> On Fri, Jan 24, 2020 at 6:30 AM lukas nalezenec 
>>> wrote:
>>>
>>> > Hi,
>>> > I can help too.
>>> > Lukas
>>> >
>>> > Dne pá 24. 1. 2020 15:29 uživatel David Mollitor 
>>> > napsal:
>>> >
>>> > > Hello Team,
>>> > >
>>> > > I am happy to do the w

[jira] [Updated] (PARQUET-1778) Do Not Consider Class for Avro Generic Record Reader

2020-01-24 Thread David Mollitor (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1778?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Mollitor updated PARQUET-1778:

Description: 
 
{code:java|title=Example Code}
final ParquetReader reader = 
AvroParquetReader.builder(path).build();
final GenericRecord genericRecord = reader.read();
{code}
It fails with...
{code:none}
java.lang.NoSuchMethodException: io.github.belugabehr.app.Record.()
at java.lang.Class.getConstructor0(Class.java:3082) ~[na:1.8.0_232]
at java.lang.Class.getDeclaredConstructor(Class.java:2178) 
~[na:1.8.0_232]
at 
org.apache.avro.specific.SpecificData$1.computeValue(SpecificData.java:63) 
~[avro-1.9.1.jar:1.9.1]
at 
org.apache.avro.specific.SpecificData$1.computeValue(SpecificData.java:58) 
~[avro-1.9.1.jar:1.9.1]
at java.lang.ClassValue.getFromHashMap(ClassValue.java:227) 
~[na:1.8.0_232]
at java.lang.ClassValue.getFromBackup(ClassValue.java:209) 
~[na:1.8.0_232]
at java.lang.ClassValue.get(ClassValue.java:115) ~[na:1.8.0_232]
at 
org.apache.avro.specific.SpecificData.newInstance(SpecificData.java:470) 
~[avro-1.9.1.jar:1.9.1]
at 
org.apache.avro.specific.SpecificData.newRecord(SpecificData.java:491) 
~[avro-1.9.1.jar:1.9.1]
at 
org.apache.parquet.avro.AvroRecordConverter.start(AvroRecordConverter.java:404) 
~[parquet-avro-1.11.0.jar:1.11.0]
at 
org.apache.parquet.io.RecordReaderImplementation.read(RecordReaderImplementation.java:392)
 ~[parquet-column-1.11.0.jar:1.11.0]
at 
org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:226)
 ~[parquet-hadoop-1.11.0.jar:1.11.0]
at org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:132) 
~[parquet-hadoop-1.11.0.jar:1.11.0]
at org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:136) 
~[parquet-hadoop-1.11.0.jar:1.11.0]
{code}
I was surprised because it should just load a {{GenericRecord}} view of the 
data. But alas, I have the Avro Schema defined with the {{namespace}} and 
{{name}} fields pointing to {{io.github.belugabehr.app.Record}} which just so 
happens to be a real class on the class path, so it is trying to call the 
public constructor on the class and this constructor does does not exist.  
Regardless, the {{GenericRecordReader}} should just ignore this Avro Schema 
namespace information.

I am putting {{GenericRecords}} into the Parquet file, I expect to get 
{{GenericRecords}} back out when I read it.

If I hack the information in a Schema and change the {{namespace}} or {{name}} 
fields to something bogus, it works as I would expect it to.  It successfully 
reads and returns a {{GenericRecord}}.

  was:
 
{code:java|title=Example Code}
final ParquetReader reader = 
AvroParquetReader.builder(path).build();
final GenericRecord genericRecord = reader.read();
{code}
It fails with...
{code:none}
java.lang.NoSuchMethodException: io.github.belugabehr.app.Record.()
at java.lang.Class.getConstructor0(Class.java:3082) ~[na:1.8.0_232]
at java.lang.Class.getDeclaredConstructor(Class.java:2178) 
~[na:1.8.0_232]
at 
org.apache.avro.specific.SpecificData$1.computeValue(SpecificData.java:63) 
~[avro-1.9.1.jar:1.9.1]
at 
org.apache.avro.specific.SpecificData$1.computeValue(SpecificData.java:58) 
~[avro-1.9.1.jar:1.9.1]
at java.lang.ClassValue.getFromHashMap(ClassValue.java:227) 
~[na:1.8.0_232]
at java.lang.ClassValue.getFromBackup(ClassValue.java:209) 
~[na:1.8.0_232]
at java.lang.ClassValue.get(ClassValue.java:115) ~[na:1.8.0_232]
at 
org.apache.avro.specific.SpecificData.newInstance(SpecificData.java:470) 
~[avro-1.9.1.jar:1.9.1]
at 
org.apache.avro.specific.SpecificData.newRecord(SpecificData.java:491) 
~[avro-1.9.1.jar:1.9.1]
at 
org.apache.parquet.avro.AvroRecordConverter.start(AvroRecordConverter.java:404) 
~[parquet-avro-1.11.0.jar:1.11.0]
at 
org.apache.parquet.io.RecordReaderImplementation.read(RecordReaderImplementation.java:392)
 ~[parquet-column-1.11.0.jar:1.11.0]
at 
org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:226)
 ~[parquet-hadoop-1.11.0.jar:1.11.0]
at org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:132) 
~[parquet-hadoop-1.11.0.jar:1.11.0]
at org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:136) 
~[parquet-hadoop-1.11.0.jar:1.11.0]
{code}
I was surprised because it should just load a {{GenericRecord}} view of the 
data. But alas, I have the Avro Schema defined with the {{namespace}} and 
{{name}} fields pointing to {{io.github.belugabehr.app.Record}} which just so 
happens to be a real class on the class path, so it is trying to call the 
public constructor on the class and this constructor does does not exist.  
Regardless, the {{GenericRecordReader}} should just

[jira] [Updated] (PARQUET-1778) Do Not Consider Class for Avro Generic Record Reader

2020-01-24 Thread David Mollitor (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1778?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Mollitor updated PARQUET-1778:

Description: 
 
{code:java|title=Example Code}
final ParquetReader reader = 
AvroParquetReader.builder(path).build();
final GenericRecord genericRecord = reader.read();
{code}
It fails with...
{code:none}
java.lang.NoSuchMethodException: io.github.belugabehr.app.Record.()
at java.lang.Class.getConstructor0(Class.java:3082) ~[na:1.8.0_232]
at java.lang.Class.getDeclaredConstructor(Class.java:2178) 
~[na:1.8.0_232]
at 
org.apache.avro.specific.SpecificData$1.computeValue(SpecificData.java:63) 
~[avro-1.9.1.jar:1.9.1]
at 
org.apache.avro.specific.SpecificData$1.computeValue(SpecificData.java:58) 
~[avro-1.9.1.jar:1.9.1]
at java.lang.ClassValue.getFromHashMap(ClassValue.java:227) 
~[na:1.8.0_232]
at java.lang.ClassValue.getFromBackup(ClassValue.java:209) 
~[na:1.8.0_232]
at java.lang.ClassValue.get(ClassValue.java:115) ~[na:1.8.0_232]
at 
org.apache.avro.specific.SpecificData.newInstance(SpecificData.java:470) 
~[avro-1.9.1.jar:1.9.1]
at 
org.apache.avro.specific.SpecificData.newRecord(SpecificData.java:491) 
~[avro-1.9.1.jar:1.9.1]
at 
org.apache.parquet.avro.AvroRecordConverter.start(AvroRecordConverter.java:404) 
~[parquet-avro-1.11.0.jar:1.11.0]
at 
org.apache.parquet.io.RecordReaderImplementation.read(RecordReaderImplementation.java:392)
 ~[parquet-column-1.11.0.jar:1.11.0]
at 
org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:226)
 ~[parquet-hadoop-1.11.0.jar:1.11.0]
at org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:132) 
~[parquet-hadoop-1.11.0.jar:1.11.0]
at org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:136) 
~[parquet-hadoop-1.11.0.jar:1.11.0]
{code}
I was surprised because it should just load a {{GenericRecord}} view of the 
data. But alas, I have the Avro Schema defined with the {{namespace}} and 
{{name}} fields pointing to {{io.github.belugabehr.app.Record}} which just so 
happens to be a real class on the class path, so it is trying to call the 
public constructor on the class and this constructor does does not exist.  
Regardless, the {{GenericRecordReader}} should just ignore this Avro Schema 
namespace information.

I am putting {{GenericRecords}} into the Parquet file, I expect to get 
{{GenericRecords}} back out when I read it.

  was:
 
{code:java|title=Example Code}
final ParquetReader reader = 
AvroParquetReader.builder(path).build();
final GenericRecord genericRecord = reader.read();
{code}
It fails with...
{code:none}
java.lang.NoSuchMethodException: io.github.belugabehr.app.Record.()
at java.lang.Class.getConstructor0(Class.java:3082) ~[na:1.8.0_232]
at java.lang.Class.getDeclaredConstructor(Class.java:2178) 
~[na:1.8.0_232]
at 
org.apache.avro.specific.SpecificData$1.computeValue(SpecificData.java:63) 
~[avro-1.9.1.jar:1.9.1]
at 
org.apache.avro.specific.SpecificData$1.computeValue(SpecificData.java:58) 
~[avro-1.9.1.jar:1.9.1]
at java.lang.ClassValue.getFromHashMap(ClassValue.java:227) 
~[na:1.8.0_232]
at java.lang.ClassValue.getFromBackup(ClassValue.java:209) 
~[na:1.8.0_232]
at java.lang.ClassValue.get(ClassValue.java:115) ~[na:1.8.0_232]
{code}
I was surprised because it should just load a {{GenericRecord}} view of the 
data. But alas, I have the Avro Schema defined with the {{namespace}} and 
{{name}} fields pointing to {{io.github.belugabehr.app.Record}} which just so 
happens to be a real class on the class path, so it is trying to call the 
public constructor on the class and this constructor does does not exist.  
Regardless, the {{GenericRecordReader}} should just ignore this Avro Schema 
namespace information.

I am putting {{GenericRecords}} into the Parquet file, I expect to get 
{{GenericRecords}} back out when I read it.


> Do Not Consider Class for Avro Generic Record Reader
> 
>
> Key: PARQUET-1778
> URL: https://issues.apache.org/jira/browse/PARQUET-1778
> Project: Parquet
>  Issue Type: Improvement
>    Reporter: David Mollitor
>Priority: Major
>
>  
> {code:java|title=Example Code}
> final ParquetReader reader = 
> AvroParquetReader.builder(path).build();
> final GenericRecord genericRecord = reader.read();
> {code}
> It fails with...
> {code:none}
> java.lang.NoSuchMethodException: io.github.belugabehr.app.Record.()
>   at java.lang.Class.getConstructor0(Class.java:3082) ~[na:1.8.0_232]
>   at java.lang.Class.getDeclaredConstructor(Class.java:2178) 
> ~[na:1.8.0_232]
>   at 
> org.apache.avro.specific.Spe

[jira] [Updated] (PARQUET-1778) Do Not Consider Class for Avro Generic Record Reader

2020-01-24 Thread David Mollitor (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1778?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Mollitor updated PARQUET-1778:

Summary: Do Not Consider Class for Avro Generic Record Reader  (was: Do Not 
Record Class for Avro Generic Record Reader)

> Do Not Consider Class for Avro Generic Record Reader
> 
>
> Key: PARQUET-1778
> URL: https://issues.apache.org/jira/browse/PARQUET-1778
> Project: Parquet
>  Issue Type: Improvement
>    Reporter: David Mollitor
>Priority: Major
>
>  
> {code:java|title=Example Code}
> final ParquetReader reader = 
> AvroParquetReader.builder(path).build();
> final GenericRecord genericRecord = reader.read();
> {code}
> It fails with...
> {code:none}
> java.lang.NoSuchMethodException: io.github.belugabehr.app.Record.()
>   at java.lang.Class.getConstructor0(Class.java:3082) ~[na:1.8.0_232]
>   at java.lang.Class.getDeclaredConstructor(Class.java:2178) 
> ~[na:1.8.0_232]
>   at 
> org.apache.avro.specific.SpecificData$1.computeValue(SpecificData.java:63) 
> ~[avro-1.9.1.jar:1.9.1]
>   at 
> org.apache.avro.specific.SpecificData$1.computeValue(SpecificData.java:58) 
> ~[avro-1.9.1.jar:1.9.1]
>   at java.lang.ClassValue.getFromHashMap(ClassValue.java:227) 
> ~[na:1.8.0_232]
>   at java.lang.ClassValue.getFromBackup(ClassValue.java:209) 
> ~[na:1.8.0_232]
>   at java.lang.ClassValue.get(ClassValue.java:115) ~[na:1.8.0_232]
> {code}
> I was surprised because it should just load a {{GenericRecord}} view of the 
> data. But alas, I have the Avro Schema defined with the {{namespace}} and 
> {{name}} fields pointing to {{io.github.belugabehr.app.Record}} which just so 
> happens to be a real class on the class path, so it is trying to call the 
> public constructor on the class and this constructor does does not exist.  
> Regardless, the {{GenericRecordReader}} should just ignore this Avro Schema 
> namespace information.
> I am putting {{GenericRecords}} into the Parquet file, I expect to get 
> {{GenericRecords}} back out when I read it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (PARQUET-1778) Do Not Record Class for Avro Generic Record Reader

2020-01-24 Thread David Mollitor (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1778?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Mollitor updated PARQUET-1778:

Description: 
 
{code:java|title=Example Code}
final ParquetReader reader = 
AvroParquetReader.builder(path).build();
final GenericRecord genericRecord = reader.read();
{code}
It fails with...
{code:none}
java.lang.NoSuchMethodException: io.github.belugabehr.app.Record.()
at java.lang.Class.getConstructor0(Class.java:3082) ~[na:1.8.0_232]
at java.lang.Class.getDeclaredConstructor(Class.java:2178) 
~[na:1.8.0_232]
at 
org.apache.avro.specific.SpecificData$1.computeValue(SpecificData.java:63) 
~[avro-1.9.1.jar:1.9.1]
at 
org.apache.avro.specific.SpecificData$1.computeValue(SpecificData.java:58) 
~[avro-1.9.1.jar:1.9.1]
at java.lang.ClassValue.getFromHashMap(ClassValue.java:227) 
~[na:1.8.0_232]
at java.lang.ClassValue.getFromBackup(ClassValue.java:209) 
~[na:1.8.0_232]
at java.lang.ClassValue.get(ClassValue.java:115) ~[na:1.8.0_232]
{code}
I was surprised because it should just load a {{GenericRecord}} view of the 
data. But alas, I have the Avro Schema defined with the {{namespace}} and 
{{name}} fields pointing to {{io.github.belugabehr.app.Record}} which just so 
happens to be a real class on the class path, so it is trying to call the 
public constructor on the class and this constructor does does not exist.  
Regardless, the {{GenericRecordReader}} should just ignore this Avro Schema 
namespace information.

I am putting {{GenericRecords}} into the Parquet file, I expect to get 
{{GenericRecords}} back out when I read it.

  was:
 
{code:java|title=Example Code}
final ParquetReader reader = 
AvroParquetReader.builder(path).build();
final GenericRecord genericRecord = reader.read();
{code}
It fails with...
{code:none}
java.lang.NoSuchMethodException: io.github.belugabehr.app.Record.()
at java.lang.Class.getConstructor0(Class.java:3082) ~[na:1.8.0_232]
at java.lang.Class.getDeclaredConstructor(Class.java:2178) 
~[na:1.8.0_232]
at 
org.apache.avro.specific.SpecificData$1.computeValue(SpecificData.java:63) 
~[avro-1.9.1.jar:1.9.1]
at 
org.apache.avro.specific.SpecificData$1.computeValue(SpecificData.java:58) 
~[avro-1.9.1.jar:1.9.1]
at java.lang.ClassValue.getFromHashMap(ClassValue.java:227) 
~[na:1.8.0_232]
at java.lang.ClassValue.getFromBackup(ClassValue.java:209) 
~[na:1.8.0_232]
at java.lang.ClassValue.get(ClassValue.java:115) ~[na:1.8.0_232]
{code}
I was surprised because it should just load a {{GenericRecord}} view of the 
data. But alas, I have the Avro Schema defined with the {{namespace}} and 
{{name}} fields pointing to {{io.github.belugabehr.app.Record}} which just so 
happens to be a real class on the class path, so it is trying to call the 
public constructor on the class which does not exist.

There {{GenericRecordReader}} should always ignore this Avro Schema namespace 
information.

I am putting {{GenericRecords}} into the Parquet file, I expect to get 
{{GenericRecords}} back out when I read it.


> Do Not Record Class for Avro Generic Record Reader
> --
>
> Key: PARQUET-1778
> URL: https://issues.apache.org/jira/browse/PARQUET-1778
> Project: Parquet
>  Issue Type: Improvement
>    Reporter: David Mollitor
>Priority: Major
>
>  
> {code:java|title=Example Code}
> final ParquetReader reader = 
> AvroParquetReader.builder(path).build();
> final GenericRecord genericRecord = reader.read();
> {code}
> It fails with...
> {code:none}
> java.lang.NoSuchMethodException: io.github.belugabehr.app.Record.()
>   at java.lang.Class.getConstructor0(Class.java:3082) ~[na:1.8.0_232]
>   at java.lang.Class.getDeclaredConstructor(Class.java:2178) 
> ~[na:1.8.0_232]
>   at 
> org.apache.avro.specific.SpecificData$1.computeValue(SpecificData.java:63) 
> ~[avro-1.9.1.jar:1.9.1]
>   at 
> org.apache.avro.specific.SpecificData$1.computeValue(SpecificData.java:58) 
> ~[avro-1.9.1.jar:1.9.1]
>   at java.lang.ClassValue.getFromHashMap(ClassValue.java:227) 
> ~[na:1.8.0_232]
>   at java.lang.ClassValue.getFromBackup(ClassValue.java:209) 
> ~[na:1.8.0_232]
>   at java.lang.ClassValue.get(ClassValue.java:115) ~[na:1.8.0_232]
> {code}
> I was surprised because it should just load a {{GenericRecord}} view of the 
> data. But alas, I have the Avro Schema defined with the {{namespace}} and 
> {{name}} fields pointing to {{io.github.belugabehr.app.Record}} which just so 
> happens to be a real class on the class path, so it is trying to call the 
> public constructor on the class and this constructor does does not exist.  
> Regardless, t

[jira] [Updated] (PARQUET-1778) Do Not Record Class for Avro Generic Record Reader

2020-01-24 Thread David Mollitor (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1778?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Mollitor updated PARQUET-1778:

Description: 
 
{code:java|title=Example Code}
final ParquetReader reader = 
AvroParquetReader.builder(path).build();
final GenericRecord genericRecord = reader.read();
{code}
It fails with...
{code:none}
java.lang.NoSuchMethodException: io.github.belugabehr.app.Record.()
at java.lang.Class.getConstructor0(Class.java:3082) ~[na:1.8.0_232]
at java.lang.Class.getDeclaredConstructor(Class.java:2178) 
~[na:1.8.0_232]
at 
org.apache.avro.specific.SpecificData$1.computeValue(SpecificData.java:63) 
~[avro-1.9.1.jar:1.9.1]
at 
org.apache.avro.specific.SpecificData$1.computeValue(SpecificData.java:58) 
~[avro-1.9.1.jar:1.9.1]
at java.lang.ClassValue.getFromHashMap(ClassValue.java:227) 
~[na:1.8.0_232]
at java.lang.ClassValue.getFromBackup(ClassValue.java:209) 
~[na:1.8.0_232]
at java.lang.ClassValue.get(ClassValue.java:115) ~[na:1.8.0_232]
{code}
I was surprised because it should just load a {{GenericRecord}} view of the 
data. But alas, I have the Avro Schema defined with the {{namespace}} and 
{{name}} fields pointing to {{io.github.belugabehr.app.Record}} which just so 
happens to be a real class on the class path, so it is trying to call the 
public constructor on the class which does not exist.

There {{GenericRecordReader}} should always ignore this Avro Schema namespace 
information.

I am putting {{GenericRecords}} into the Parquet file, I expect to get 
{{GenericRecords}} back out when I read it.

  was:
{code:java}

final ParquetReader reader = 
AvroParquetReader.builder(path).build();final 
ParquetReader reader = 
AvroParquetReader.builder(path).build(); final GenericRecord 
genericRecord = reader.read();
{code}

It fails with...

{code:none}
java.lang.NoSuchMethodException: io.github.belugabehr.app.Record.()
at java.lang.Class.getConstructor0(Class.java:3082) ~[na:1.8.0_232]
at java.lang.Class.getDeclaredConstructor(Class.java:2178) 
~[na:1.8.0_232]
at 
org.apache.avro.specific.SpecificData$1.computeValue(SpecificData.java:63) 
~[avro-1.9.1.jar:1.9.1]
at 
org.apache.avro.specific.SpecificData$1.computeValue(SpecificData.java:58) 
~[avro-1.9.1.jar:1.9.1]
at java.lang.ClassValue.getFromHashMap(ClassValue.java:227) 
~[na:1.8.0_232]
at java.lang.ClassValue.getFromBackup(ClassValue.java:209) 
~[na:1.8.0_232]
at java.lang.ClassValue.get(ClassValue.java:115) ~[na:1.8.0_232]
{code}

I was surprised because it should just load a {{GenericRecord}} view of the 
data.  But alas, I have the Avro Schema defined with the {{namespace}} and 
{{name}} fields pointing to {{io.github.belugabehr.app.Record}} which just so 
happens to be a real class on the class path, so it is trying to call the 
public constructor on the class which does not exist.

There {{GenericRecordReader}} should always ignore this Avro Schema namespace 
information.

I am putting {{GenericRecords}} into the Parquet file, I expect to get 
{{GenericRecords}} back out when I read it.


> Do Not Record Class for Avro Generic Record Reader
> --
>
> Key: PARQUET-1778
> URL: https://issues.apache.org/jira/browse/PARQUET-1778
> Project: Parquet
>  Issue Type: Improvement
>    Reporter: David Mollitor
>Priority: Major
>
>  
> {code:java|title=Example Code}
> final ParquetReader reader = 
> AvroParquetReader.builder(path).build();
> final GenericRecord genericRecord = reader.read();
> {code}
> It fails with...
> {code:none}
> java.lang.NoSuchMethodException: io.github.belugabehr.app.Record.()
>   at java.lang.Class.getConstructor0(Class.java:3082) ~[na:1.8.0_232]
>   at java.lang.Class.getDeclaredConstructor(Class.java:2178) 
> ~[na:1.8.0_232]
>   at 
> org.apache.avro.specific.SpecificData$1.computeValue(SpecificData.java:63) 
> ~[avro-1.9.1.jar:1.9.1]
>   at 
> org.apache.avro.specific.SpecificData$1.computeValue(SpecificData.java:58) 
> ~[avro-1.9.1.jar:1.9.1]
>   at java.lang.ClassValue.getFromHashMap(ClassValue.java:227) 
> ~[na:1.8.0_232]
>   at java.lang.ClassValue.getFromBackup(ClassValue.java:209) 
> ~[na:1.8.0_232]
>   at java.lang.ClassValue.get(ClassValue.java:115) ~[na:1.8.0_232]
> {code}
> I was surprised because it should just load a {{GenericRecord}} view of the 
> data. But alas, I have the Avro Schema defined with the {{namespace}} and 
> {{name}} fields pointing to {{io.github.belugabehr.app.Record}} which just so 
> happens to be a real class on the class path, so it is trying to call the 
> public constructor on the class which does not exist.
> There

[jira] [Created] (PARQUET-1778) Do Not Record Class for Avro Generic Record Reader

2020-01-24 Thread David Mollitor (Jira)
David Mollitor created PARQUET-1778:
---

 Summary: Do Not Record Class for Avro Generic Record Reader
 Key: PARQUET-1778
 URL: https://issues.apache.org/jira/browse/PARQUET-1778
 Project: Parquet
  Issue Type: Improvement
Reporter: David Mollitor


{code:java}

final ParquetReader reader = 
AvroParquetReader.builder(path).build();final 
ParquetReader reader = 
AvroParquetReader.builder(path).build(); final GenericRecord 
genericRecord = reader.read();
{code}

It fails with...

{code:none}
java.lang.NoSuchMethodException: io.github.belugabehr.app.Record.()
at java.lang.Class.getConstructor0(Class.java:3082) ~[na:1.8.0_232]
at java.lang.Class.getDeclaredConstructor(Class.java:2178) 
~[na:1.8.0_232]
at 
org.apache.avro.specific.SpecificData$1.computeValue(SpecificData.java:63) 
~[avro-1.9.1.jar:1.9.1]
at 
org.apache.avro.specific.SpecificData$1.computeValue(SpecificData.java:58) 
~[avro-1.9.1.jar:1.9.1]
at java.lang.ClassValue.getFromHashMap(ClassValue.java:227) 
~[na:1.8.0_232]
at java.lang.ClassValue.getFromBackup(ClassValue.java:209) 
~[na:1.8.0_232]
at java.lang.ClassValue.get(ClassValue.java:115) ~[na:1.8.0_232]
{code}

I was surprised because it should just load a {{GenericRecord}} view of the 
data.  But alas, I have the Avro Schema defined with the {{namespace}} and 
{{name}} fields pointing to {{io.github.belugabehr.app.Record}} which just so 
happens to be a real class on the class path, so it is trying to call the 
public constructor on the class which does not exist.

There {{GenericRecordReader}} should always ignore this Avro Schema namespace 
information.

I am putting {{GenericRecords}} into the Parquet file, I expect to get 
{{GenericRecords}} back out when I read it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: Parquet Verbose Logging

2020-01-24 Thread David Mollitor
Hello Ryan,

I appreciate you taking the time to share your thoughts.

I'd just like to point out that there is also TRACE level logging if
Parquet requires greater granularity.

Furthermore, I'm not suggesting that there be an unbreakable rule that all
logging must be DEBUG, but it should be the exception, not the rule.  It is
more likely the situation the the wrapping application would be responsible
for logging at the INFO and WARN/ERROR level.  Something like

try {
   LOG.info("Using Parquet to read file {}", path);
   avroParquetReader.read();
} catch (Exception e) {
  LOG.error("Failed to read Parquet file", e);
}

This is a very normal setup and doesn't require any additional logging from
the Parquet library itself.  Once I see an error with "Failed to re Parquet
file", then I'm going to turn on DEBUG logging and try to reproduce the
error.

Thanks,
David

On Fri, Jan 24, 2020 at 12:01 PM Ryan Blue 
wrote:

> I don't agree with the idea to convert all of Parquet's logs to DEBUG
> level, but I do think that we can improve the levels of individual
> messages.
>
> If we convert all logs to debug, then turning on logs to see what Parquet
> is doing would show everything from opening an input file to position
> tracking in output files. That's way too much information, which is why we
> use different log levels to begin with.
>
> I think we should continue using log levels to distinguish between types of
> information: error for errors, warn for recoverable errors that may or may
> not indicate a problem, info for regular operations, and debug for extra
> information if you're debugging the Parquet library. Following the common
> convention enables people to choose what information they want instead of
> mixing it all together.
>
> If you want to only see error and warning logs from Parquet, then the right
> way to do that is to configure your logger so that the level for
> org.apache.parquet classes is warn. That's not to say I don't agree that we
> can cut down on what is logged at info and clean it up; I just don't think
> it's a good idea to abandon the idea of log levels to distinguish between
> different information the user of a library will need.
>
> On Fri, Jan 24, 2020 at 6:30 AM lukas nalezenec  wrote:
>
> > Hi,
> > I can help too.
> > Lukas
> >
> > Dne pá 24. 1. 2020 15:29 uživatel David Mollitor 
> > napsal:
> >
> > > Hello Team,
> > >
> > > I am happy to do the work of reviewing all Parquet logging, but I need
> > help
> > > getting the work committed.
> > >
> > > Fokko Driesprong has been a wonderfully ally in helping me get
> > incremental
> > > improvements into Parquet, but I wonder if there's anyone else that can
> > > share in the load.
> > >
> > > Thanks,
> > > David
> > >
> > > On Thu, Jan 23, 2020 at 11:55 AM Michael Heuer 
> > wrote:
> > >
> > > > Hello David,
> > > >
> > > > As I mentioned on PARQUET-1758, we have been frustrated by overly
> > verbose
> > > > logging in Parquet for a long time.  Various workarounds have been
> more
> > > or
> > > > less successful, e.g.
> > > >
> > > > https://github.com/bigdatagenomics/adam/issues/851 <
> > > > https://github.com/bigdatagenomics/adam/issues/851>
> > > >
> > > > I would support a move making Parquet a silent partner.  :)
> > > >
> > > >michael
> > > >
> > > >
> > > > > On Jan 23, 2020, at 10:25 AM, David Mollitor 
> > > wrote:
> > > > >
> > > > > Hello Team,
> > > > >
> > > > > I have been a consumer of Apache Parquet through Apache Hive for
> > > several
> > > > > years now.  For a long time, logging in Parquet has been pretty
> > > painful.
> > > > > Some of the logging was going to STDOUT and some was going to
> Log4J.
> > > > > Overall, though the framework has been too verbose, spewing many
> log
> > > > lines
> > > > > about internal details of Parquet I don't understand.
> > > > >
> > > > > The logging has gotten a lot better with recent releases moving
> > solidly
> > > > > into SLF4J.  That is awesome and very welcomed.  However, (opinion
> > > > alert) I
> > > > > think the logging is still too verbose.  I think Parquet should be
> a
> > > > silent
> > > > > partner in data processing.  If everything is going well, it should
> > be
> > > 

Re: Writing to Local File

2020-01-24 Thread David Mollitor
Thanks Ryan for the confirmation of my suspicions.

That would certainly make a quick sample application easier to achieve from
an adoption perspective.

I had just put this JIRA in.  I'll leave it open for anyone to jump in on.
https://issues.apache.org/jira/browse/PARQUET-1776

Thanks,
David


On Fri, Jan 24, 2020 at 12:08 PM Ryan Blue 
wrote:

> There's not currently a way to do this without Hadoop. We've been working
> on moving to the `InputFile` and `OutputFile` abstractions so that we can
> get rid of it, but Parquet still depends on Hadoop libraries for
> compression and we haven't pulled out the parts of Parquet that use the new
> abstraction from the older ones that accept Hadoop Paths, so you need to
> have Hadoop in your classpath either way.
>
> To get to where you can write a file without Hadoop dependencies, I think
> we need to create a new module that parquet-hadoop will depend on with the
> `InputFile`/`OutputFile` implementation. Then we would refactor the Hadoop
> classes to extend those implementations to avoid breaking the Hadoop
> classes. We'd also need to implement the compression API directly on top of
> aircompressor in this module.
>
> On Thu, Jan 23, 2020 at 4:40 PM David Mollitor  wrote:
>
> > I am usually a user of Parquet through Hive or Spark, but I wanted to sit
> > down and write my own small example application of using the library
> > directly.
> >
> > Is there some quick way that I can write a Parquet file to the local file
> > system using java.nio.Path (i.e., with no Hadoop dependencies?)
> >
> > Thanks!
> >
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>


[jira] [Updated] (PARQUET-1776) Add Java NIO Avro OutputFile InputFile

2020-01-24 Thread David Mollitor (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Mollitor updated PARQUET-1776:

Description: Add a wrapper around Java NIO Path for 
{{org.apache.parquet.io.OutputFile}} and {{org.apache.parquet.io.InputFile}}  
(was: Add a wrapper around Java NIO for {{org.apache.parquet.io.OutputFile}} 
and {{org.apache.parquet.io.InputFile}})

> Add Java NIO Avro OutputFile InputFile
> --
>
> Key: PARQUET-1776
> URL: https://issues.apache.org/jira/browse/PARQUET-1776
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-avro
>    Reporter: David Mollitor
>Priority: Minor
>
> Add a wrapper around Java NIO Path for {{org.apache.parquet.io.OutputFile}} 
> and {{org.apache.parquet.io.InputFile}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (PARQUET-1776) Add Java NIO Avro OutputFile InputFile

2020-01-24 Thread David Mollitor (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Mollitor updated PARQUET-1776:

Labels:   (was: avro)

> Add Java NIO Avro OutputFile InputFile
> --
>
> Key: PARQUET-1776
> URL: https://issues.apache.org/jira/browse/PARQUET-1776
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-avro
>    Reporter: David Mollitor
>Priority: Minor
>
> Add a wrapper around Java NIO for {{org.apache.parquet.io.OutputFile}} and 
> {{org.apache.parquet.io.InputFile}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (PARQUET-1776) Add Java NIO Avro OutputFile InputFile

2020-01-24 Thread David Mollitor (Jira)
David Mollitor created PARQUET-1776:
---

 Summary: Add Java NIO Avro OutputFile InputFile
 Key: PARQUET-1776
 URL: https://issues.apache.org/jira/browse/PARQUET-1776
 Project: Parquet
  Issue Type: Improvement
Reporter: David Mollitor


Add a wrapper around Java NIO for {{org.apache.parquet.io.OutputFile}} and 
{{org.apache.parquet.io.InputFile}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (PARQUET-1776) Add Java NIO Avro OutputFile InputFile

2020-01-24 Thread David Mollitor (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Mollitor updated PARQUET-1776:

Labels: avro  (was: )

> Add Java NIO Avro OutputFile InputFile
> --
>
> Key: PARQUET-1776
> URL: https://issues.apache.org/jira/browse/PARQUET-1776
> Project: Parquet
>  Issue Type: Improvement
>    Reporter: David Mollitor
>Priority: Minor
>  Labels: avro
>
> Add a wrapper around Java NIO for {{org.apache.parquet.io.OutputFile}} and 
> {{org.apache.parquet.io.InputFile}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (PARQUET-1776) Add Java NIO Avro OutputFile InputFile

2020-01-24 Thread David Mollitor (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Mollitor updated PARQUET-1776:

Component/s: parquet-avro

> Add Java NIO Avro OutputFile InputFile
> --
>
> Key: PARQUET-1776
> URL: https://issues.apache.org/jira/browse/PARQUET-1776
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-avro
>    Reporter: David Mollitor
>Priority: Minor
>  Labels: avro
>
> Add a wrapper around Java NIO for {{org.apache.parquet.io.OutputFile}} and 
> {{org.apache.parquet.io.InputFile}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (PARQUET-1775) Deprecate AvroParquetWriter Builder Hadoop Path

2020-01-24 Thread David Mollitor (Jira)
David Mollitor created PARQUET-1775:
---

 Summary: Deprecate AvroParquetWriter Builder Hadoop Path
 Key: PARQUET-1775
 URL: https://issues.apache.org/jira/browse/PARQUET-1775
 Project: Parquet
  Issue Type: Improvement
Reporter: David Mollitor
Assignee: David Mollitor


Trying to write a sample program with Parquet and came across the following 
quark:

 

The {{AvroParquetWriter}} has no qualms about building one using 
{{org.apache.hadoop.fs.Path}}.  However, doing so in {{AvroParquetReader}} is 
deprecated.  I think it's appropriate to remove all dependencies of Hadoop from 
this simple reader/writer API.

 

To make it consistent, also deprecate the use of {{org.apache.hadoop.fs.Path}} 
in the {{AvroParquetWriter.}}

 

[https://github.com/apache/parquet-mr/blob/8c1bc9bcdeeac8178fecf61d18dc56913907fd46/parquet-avro/src/main/java/org/apache/parquet/avro/AvroParquetWriter.java#L38]

 

https://github.com/apache/parquet-mr/blob/8c1bc9bcdeeac8178fecf61d18dc56913907fd46/parquet-avro/src/main/java/org/apache/parquet/avro/AvroParquetReader.java#L47



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: Parquet Verbose Logging

2020-01-24 Thread David Mollitor
Hello Team,

I am happy to do the work of reviewing all Parquet logging, but I need help
getting the work committed.

Fokko Driesprong has been a wonderfully ally in helping me get incremental
improvements into Parquet, but I wonder if there's anyone else that can
share in the load.

Thanks,
David

On Thu, Jan 23, 2020 at 11:55 AM Michael Heuer  wrote:

> Hello David,
>
> As I mentioned on PARQUET-1758, we have been frustrated by overly verbose
> logging in Parquet for a long time.  Various workarounds have been more or
> less successful, e.g.
>
> https://github.com/bigdatagenomics/adam/issues/851 <
> https://github.com/bigdatagenomics/adam/issues/851>
>
> I would support a move making Parquet a silent partner.  :)
>
>michael
>
>
> > On Jan 23, 2020, at 10:25 AM, David Mollitor  wrote:
> >
> > Hello Team,
> >
> > I have been a consumer of Apache Parquet through Apache Hive for several
> > years now.  For a long time, logging in Parquet has been pretty painful.
> > Some of the logging was going to STDOUT and some was going to Log4J.
> > Overall, though the framework has been too verbose, spewing many log
> lines
> > about internal details of Parquet I don't understand.
> >
> > The logging has gotten a lot better with recent releases moving solidly
> > into SLF4J.  That is awesome and very welcomed.  However, (opinion
> alert) I
> > think the logging is still too verbose.  I think Parquet should be a
> silent
> > partner in data processing.  If everything is going well, it should be
> > silent (DEBUG level logging).  If things are going wrong, it should throw
> > an Exception.
> >
> > If an operator suspects Parquet is the issue (and that's rarely the first
> > thing to check), they can set the logging for all of the Loggers in the
> > entire Parquet package (org.apache.parquet) to DEBUG to get the required
> > information.  Not to mention, the less logging it does, the faster it
> will
> > be.
> >
> > I've opened this discussion because I've got two PRs related to this
> topic
> > ready to go:
> >
> > PARQUET-1758
> > PARQUET-1761
> >
> > Thanks,
> > David
>
>


Writing to Local File

2020-01-23 Thread David Mollitor
I am usually a user of Parquet through Hive or Spark, but I wanted to sit
down and write my own small example application of using the library
directly.

Is there some quick way that I can write a Parquet file to the local file
system using java.nio.Path (i.e., with no Hadoop dependencies?)

Thanks!


[jira] [Updated] (PARQUET-1758) InternalParquetRecordReader Logging is Too Verbose

2020-01-23 Thread David Mollitor (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Mollitor updated PARQUET-1758:

Summary: InternalParquetRecordReader Logging is Too Verbose  (was: 
InternalParquetRecordReader Logging it Too Verbose)

> InternalParquetRecordReader Logging is Too Verbose
> --
>
> Key: PARQUET-1758
> URL: https://issues.apache.org/jira/browse/PARQUET-1758
> Project: Parquet
>  Issue Type: Improvement
>    Reporter: David Mollitor
>    Assignee: David Mollitor
>Priority: Minor
>  Labels: pull-request-available
>
> A low-level library like Parquet should be pretty quiet.  It should just do 
> its work and keep quiet.  Most issues should be addressed by throwing 
> Exceptions, and the occasional warning message otherwise it will clutter the 
> logging for the top-level application.  If debugging is required, 
> administrator can enable it for the specific workload.
> *Warning:* This is my opinion. No stats to back it up.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Parquet Verbose Logging

2020-01-23 Thread David Mollitor
Hello Team,

I have been a consumer of Apache Parquet through Apache Hive for several
years now.  For a long time, logging in Parquet has been pretty painful.
Some of the logging was going to STDOUT and some was going to Log4J.
Overall, though the framework has been too verbose, spewing many log lines
about internal details of Parquet I don't understand.

The logging has gotten a lot better with recent releases moving solidly
into SLF4J.  That is awesome and very welcomed.  However, (opinion alert) I
think the logging is still too verbose.  I think Parquet should be a silent
partner in data processing.  If everything is going well, it should be
silent (DEBUG level logging).  If things are going wrong, it should throw
an Exception.

If an operator suspects Parquet is the issue (and that's rarely the first
thing to check), they can set the logging for all of the Loggers in the
entire Parquet package (org.apache.parquet) to DEBUG to get the required
information.  Not to mention, the less logging it does, the faster it will
be.

I've opened this discussion because I've got two PRs related to this topic
ready to go:

PARQUET-1758
PARQUET-1761

Thanks,
David


Re: Spotless

2020-01-22 Thread David Mollitor
I think you want this in place before bloom filters are released.  Since
it's the newest code, it is most at risk of receiving fixes and
improvements.  You're not going to want to use spotless after the feature
is introduced and make back ports more difficult.

On Wed, Jan 22, 2020, 10:02 AM Driesprong, Fokko 
wrote:

> I've rebased the PR: https://github.com/apache/parquet-mr/pull/730
>
> I did some searching and as far as I can tell, spotless does not allow to
> only apply it to VCS changed lines. If the forked repo also applies
> spotless, then it would be possible to do a diff.
>
> For me, I'm still interested in applying this, so we can keep our code
> clean and consistent. For example, I would like to enforce the use of
> braces, as it makes the code much more readable in my opinion.
>
> Cheers, Fokko
>
>
>
> Op do 9 jan. 2020 om 10:23 schreef Gabor Szadovszky :
>
> > Personally, I don't like formatting the whole code during minor version
> > development. These changes make really hard cherry-picking changes to
> > forked repos. It also makes hard to blame the code.
> > It is great to have a common code style and formatting configuration but
> I
> > would only apply them to the new lines. Let's do such changes that
> impacts
> > the whole code base at the beginning of a new major version development
> > where compatibility will break anyway.
> >
> > I am hesitating to give a -1, though. If everyone agrees on this is a
> good
> > idea, I'm fine with that. So, let me give a -0.
> >
> > Cheers,
> > Gabor
> >
> > On Wed, Jan 8, 2020 at 7:36 PM Ryan Blue 
> > wrote:
> >
> > > +1 for spotless checks.
> > >
> > > On Wed, Jan 8, 2020 at 7:13 AM Driesprong, Fokko  >
> > > wrote:
> > >
> > > > Y'all,
> > > >
> > > > Recently Chen Junjie brought up the removal of trailing spaces within
> > the
> > > > code and the headers:
> > > > https://github.com/apache/parquet-mr/pull/727#issuecomment-571562392
> > > >
> > > > I've been looking into this and looked into if we can apply something
> > > like
> > > > checkstyle to let it fail on trailing whitespace. However, it comes
> up
> > > with
> > > > a LOT of warnings on improper formatting, short variables, wrong
> import
> > > > orders, etc.
> > > > For Apache Avro we've added Spotless as a maven plugin:
> > > > https://github.com/diffplug/spotless. Unlike checkstyle, spotless
> will
> > > > also
> > > > fix the formatting. Would this be something that others find useful?
> > > > The main problem is that we need to apply this to the codebase, and
> > this
> > > > will break a lot of PR's, and it will mess up a bit of the version
> > > control,
> > > > because a lot of lines will be changed:
> > > > https://github.com/apache/parquet-mr/pull/730/
> > > >
> > > > WDYT?
> > > >
> > > > Cheers, Fokko
> > > >
> > >
> > >
> > > --
> > > Ryan Blue
> > > Software Engineer
> > > Netflix
> > >
> >
>


[jira] [Commented] (PARQUET-1758) InternalParquetRecordReader Logging it Too Verbose

2020-01-13 Thread David Mollitor (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17014416#comment-17014416
 ] 

David Mollitor commented on PARQUET-1758:
-

I think the general idea is that almost all logging is DEBUG level for such a 
library.  It may be advantageous to setup YETUS so that the automated builds 
are with DEBUG log enabled, but my feeling is that most logging shouldn't be 
enabled by default.

> InternalParquetRecordReader Logging it Too Verbose
> --
>
> Key: PARQUET-1758
> URL: https://issues.apache.org/jira/browse/PARQUET-1758
> Project: Parquet
>  Issue Type: Improvement
>    Reporter: David Mollitor
>    Assignee: David Mollitor
>Priority: Minor
>  Labels: pull-request-available
>
> A low-level library like Parquet should be pretty quiet.  It should just do 
> its work and keep quiet.  Most issues should be addressed by throwing 
> Exceptions, and the occasional warning message otherwise it will clutter the 
> logging for the top-level application.  If debugging is required, 
> administrator can enable it for the specific workload.
> *Warning:* This is my opinion. No stats to back it up.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (PARQUET-1758) InternalParquetRecordReader Logging it Too Verbose

2020-01-12 Thread David Mollitor (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17014010#comment-17014010
 ] 

David Mollitor edited comment on PARQUET-1758 at 1/13/20 4:21 AM:
--

I am certainly open for discussions.  I too have had some logging pain 
emanating from Parquet with the Apache Hive project.

Debug logging would only help performance since less time would be spent 
logging.


was (Author: belugabehr):
Debug logging would only help performance since less time would be spent 
logging.

> InternalParquetRecordReader Logging it Too Verbose
> --
>
> Key: PARQUET-1758
> URL: https://issues.apache.org/jira/browse/PARQUET-1758
> Project: Parquet
>  Issue Type: Improvement
>    Reporter: David Mollitor
>    Assignee: David Mollitor
>Priority: Minor
>  Labels: pull-request-available
>
> A low-level library like Parquet should be pretty quiet.  It should just do 
> its work and keep quiet.  Most issues should be addressed by throwing 
> Exceptions, and the occasional warning message otherwise it will clutter the 
> logging for the top-level application.  If debugging is required, 
> administrator can enable it for the specific workload.
> *Warning:* This is my opinion. No stats to back it up.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1758) InternalParquetRecordReader Logging it Too Verbose

2020-01-12 Thread David Mollitor (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17014010#comment-17014010
 ] 

David Mollitor commented on PARQUET-1758:
-

Debug logging would only help performance since less time would be spent 
logging.

> InternalParquetRecordReader Logging it Too Verbose
> --
>
> Key: PARQUET-1758
> URL: https://issues.apache.org/jira/browse/PARQUET-1758
> Project: Parquet
>  Issue Type: Improvement
>    Reporter: David Mollitor
>    Assignee: David Mollitor
>Priority: Minor
>  Labels: pull-request-available
>
> A low-level library like Parquet should be pretty quiet.  It should just do 
> its work and keep quiet.  Most issues should be addressed by throwing 
> Exceptions, and the occasional warning message otherwise it will clutter the 
> logging for the top-level application.  If debugging is required, 
> administrator can enable it for the specific workload.
> *Warning:* This is my opinion. No stats to back it up.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (PARQUET-1763) Add SLF4J to TestCircularReferences

2020-01-12 Thread David Mollitor (Jira)
David Mollitor created PARQUET-1763:
---

 Summary: Add SLF4J to TestCircularReferences
 Key: PARQUET-1763
 URL: https://issues.apache.org/jira/browse/PARQUET-1763
 Project: Parquet
  Issue Type: Improvement
Reporter: David Mollitor
Assignee: David Mollitor


Currently prints to STDOUT.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (PARQUET-1762) Move BitPackingPerfTest to parquet-benchmarks Module

2020-01-12 Thread David Mollitor (Jira)
David Mollitor created PARQUET-1762:
---

 Summary: Move BitPackingPerfTest to parquet-benchmarks Module
 Key: PARQUET-1762
 URL: https://issues.apache.org/jira/browse/PARQUET-1762
 Project: Parquet
  Issue Type: Improvement
Reporter: David Mollitor






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (PARQUET-1761) Lower Logging Level in ParquetOutputFormat

2020-01-12 Thread David Mollitor (Jira)
David Mollitor created PARQUET-1761:
---

 Summary: Lower Logging Level in ParquetOutputFormat
 Key: PARQUET-1761
 URL: https://issues.apache.org/jira/browse/PARQUET-1761
 Project: Parquet
  Issue Type: Improvement
Reporter: David Mollitor
Assignee: David Mollitor






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (PARQUET-1760) Use SLF4J Logger for TestStatistics

2020-01-12 Thread David Mollitor (Jira)
David Mollitor created PARQUET-1760:
---

 Summary: Use SLF4J Logger for TestStatistics
 Key: PARQUET-1760
 URL: https://issues.apache.org/jira/browse/PARQUET-1760
 Project: Parquet
  Issue Type: Improvement
Reporter: David Mollitor
Assignee: David Mollitor


It is dumping a lot of logging into STDOUT and STDERR.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (PARQUET-1759) InternalParquetRecordReader Use Singleton Set

2020-01-12 Thread David Mollitor (Jira)
David Mollitor created PARQUET-1759:
---

 Summary: InternalParquetRecordReader Use Singleton Set
 Key: PARQUET-1759
 URL: https://issues.apache.org/jira/browse/PARQUET-1759
 Project: Parquet
  Issue Type: Improvement
Reporter: David Mollitor
Assignee: David Mollitor


https://github.com/apache/parquet-mr/blob/d85a8f5dcfc1381655fcccaa81a2e83ba812f6a4/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/InternalParquetRecordReader.java#L260-L262

Code currently instantiates a {{HashSet}} (with a default internal data 
structure of size 16}} and then makes it immutable.  Use Collection#singleton 
to achieve this same goal with fewer lines of code and less memory requirements.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (PARQUET-1758) InternalParquetRecordReader Logging it Too Verbose

2020-01-12 Thread David Mollitor (Jira)
David Mollitor created PARQUET-1758:
---

 Summary: InternalParquetRecordReader Logging it Too Verbose
 Key: PARQUET-1758
 URL: https://issues.apache.org/jira/browse/PARQUET-1758
 Project: Parquet
  Issue Type: Improvement
Reporter: David Mollitor
Assignee: David Mollitor


A low-level library like Parquet should be pretty quiet.  It should just do its 
work and keep quiet.  Most issues should be addressed by throwing Exceptions, 
and the occasional warning message otherwise it will clutter the logging for 
the top-level application.  If debugging is required, administrator can enable 
it for the specific workload.

*Warning:* This is my opinion. No stats to back it up.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (PARQUET-1756) Remove Dependency on Maven Plugin semantic-versioning

2020-01-10 Thread David Mollitor (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1756?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Mollitor updated PARQUET-1756:

Summary: Remove Dependency on Maven Plugin semantic-versioning  (was: 
Remove References to Maven Plugin semantic-versioning)

> Remove Dependency on Maven Plugin semantic-versioning
> -
>
> Key: PARQUET-1756
> URL: https://issues.apache.org/jira/browse/PARQUET-1756
> Project: Parquet
>  Issue Type: Improvement
>    Reporter: David Mollitor
>    Assignee: David Mollitor
>Priority: Minor
>
> https://github.com/jeluard/semantic-versioning
> According to their github page:
> {quote}
> This library is in dormant state and won't add any new feature. 
> {quote}
> Also, looking at their README file, it looks like the Parquet library is 
> including their library in the Maven build process, but is not actually 
> calling it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (PARQUET-1757) Upgrade Apache POM Parent Version to 22

2020-01-10 Thread David Mollitor (Jira)
David Mollitor created PARQUET-1757:
---

 Summary: Upgrade Apache POM Parent Version to 22
 Key: PARQUET-1757
 URL: https://issues.apache.org/jira/browse/PARQUET-1757
 Project: Parquet
  Issue Type: Improvement
Reporter: David Mollitor
Assignee: David Mollitor






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (PARQUET-1756) Remove References to Maven Plugin semantic-versioning

2020-01-10 Thread David Mollitor (Jira)
David Mollitor created PARQUET-1756:
---

 Summary: Remove References to Maven Plugin semantic-versioning
 Key: PARQUET-1756
 URL: https://issues.apache.org/jira/browse/PARQUET-1756
 Project: Parquet
  Issue Type: Improvement
Reporter: David Mollitor
Assignee: David Mollitor


https://github.com/jeluard/semantic-versioning

According to their github page:

{quote}
This library is in dormant state and won't add any new feature. 
{quote}

Also, looking at their README file, it looks like the Parquet library is 
including their library in the Maven build process, but is not actually calling 
it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (PARQUET-1755) Remove slf4j-simple From parquet-benchmarks Module

2020-01-10 Thread David Mollitor (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1755?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Mollitor updated PARQUET-1755:

Summary: Remove slf4j-simple From parquet-benchmarks Module  (was: Module 
parquet-benchmarks Ships With slf4j-simple)

> Remove slf4j-simple From parquet-benchmarks Module
> --
>
> Key: PARQUET-1755
> URL: https://issues.apache.org/jira/browse/PARQUET-1755
> Project: Parquet
>  Issue Type: Improvement
>    Reporter: David Mollitor
>    Assignee: David Mollitor
>Priority: Minor
>
> The {{parquet-benchmarks}} module ships with the Log4J logger and the SLF4J 
> "simple" logger.  Since this is a stand-alone application and needs Log4J, 
> there is no reason to also use the "simple" logger.
> {code:none}
> ### parquet-benchmarks
> [1;34mINFO] Including org.slf4j:slf4j-simple:jar:1.7.22 in the shaded jar.
> [1;34mINFO] Including org.slf4j:slf4j-api:jar:1.7.22 in the shaded jar.
> [1;34mINFO] Including log4j:log4j:jar:1.2.17 in the shaded jar.
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (PARQUET-1755) Module parquet-benchmarks Ships With slf4j-simple

2020-01-10 Thread David Mollitor (Jira)
David Mollitor created PARQUET-1755:
---

 Summary: Module parquet-benchmarks Ships With slf4j-simple
 Key: PARQUET-1755
 URL: https://issues.apache.org/jira/browse/PARQUET-1755
 Project: Parquet
  Issue Type: Improvement
Reporter: David Mollitor
Assignee: David Mollitor


The {{parquet-benchmarks}} module ships with the Log4J logger and the SLF4J 
"simple" logger.  Since this is a stand-alone application and needs Log4J, 
there is no reason to also use the "simple" logger.

{code:none}
### parquet-benchmarks

[1;34mINFO] Including org.slf4j:slf4j-simple:jar:1.7.22 in the shaded jar.
[1;34mINFO] Including org.slf4j:slf4j-api:jar:1.7.22 in the shaded jar.
[1;34mINFO] Including log4j:log4j:jar:1.2.17 in the shaded jar.
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (PARQUET-1754) Include SLF4J Logger For parquet-format-structures Tests

2020-01-10 Thread David Mollitor (Jira)
David Mollitor created PARQUET-1754:
---

 Summary: Include SLF4J Logger For parquet-format-structures Tests
 Key: PARQUET-1754
 URL: https://issues.apache.org/jira/browse/PARQUET-1754
 Project: Parquet
  Issue Type: Improvement
Reporter: David Mollitor
Assignee: David Mollitor


{code:none}
### /home/apache/parquet/parquet-mr/parquet-format-structures

---
 T E S T S
---
Running org.apache.parquet.format.TestUtil
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further 
details.
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (PARQUET-1753) Ensure Parquet Version slf4j Libraries Are Included In parquet-thrift Module

2020-01-10 Thread David Mollitor (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Mollitor updated PARQUET-1753:

Summary: Ensure Parquet Version slf4j Libraries Are Included In 
parquet-thrift Module  (was: Ensure Parquet Version slf4j Libraries Are 
Included)

> Ensure Parquet Version slf4j Libraries Are Included In parquet-thrift Module
> 
>
> Key: PARQUET-1753
> URL: https://issues.apache.org/jira/browse/PARQUET-1753
> Project: Parquet
>  Issue Type: Improvement
>    Reporter: David Mollitor
>    Assignee: David Mollitor
>Priority: Minor
>
> {code:none}
> ### parquet-thrift
> [1;34mINFO] Excluding com.google.code.findbugs:jsr305:jar:3.0.0 from the 
> shaded jar.
> [1;34mINFO] Excluding com.twitter.elephantbird:elephant-bird-core:jar:4.4 
> from the shaded jar.
> [1;34mINFO] Excluding 
> com.twitter.elephantbird:elephant-bird-hadoop-compat:jar:4.4 from the shaded 
> jar.
> ***[1;34mINFO] Excluding org.slf4j:slf4j-api:jar:1.6.4 from the shaded jar.***
> [1;34mINFO] Excluding commons-lang:commons-lang:jar:2.4 from the shaded jar.
> [1;34mINFO] Excluding com.google.guava:guava:jar:11.0.1 from the shaded jar.
> {code}
> You can see that slf4j-api is version *1.6.4*.  All other parquet modules are 
> using *1.7.x*.
> 1.6.4 is being brought in by some old dependencies (primarily 
> {{com.twitter.elephantbird}}).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (PARQUET-1753) Ensure Parquet Version slf4j Libraries Are Included

2020-01-10 Thread David Mollitor (Jira)
David Mollitor created PARQUET-1753:
---

 Summary: Ensure Parquet Version slf4j Libraries Are Included
 Key: PARQUET-1753
 URL: https://issues.apache.org/jira/browse/PARQUET-1753
 Project: Parquet
  Issue Type: Improvement
Reporter: David Mollitor
Assignee: David Mollitor


{code:none}
### parquet-thrift
[1;34mINFO] Excluding com.google.code.findbugs:jsr305:jar:3.0.0 from the shaded 
jar.
[1;34mINFO] Excluding com.twitter.elephantbird:elephant-bird-core:jar:4.4 from 
the shaded jar.
[1;34mINFO] Excluding 
com.twitter.elephantbird:elephant-bird-hadoop-compat:jar:4.4 from the shaded 
jar.
***[1;34mINFO] Excluding org.slf4j:slf4j-api:jar:1.6.4 from the shaded jar.***
[1;34mINFO] Excluding commons-lang:commons-lang:jar:2.4 from the shaded jar.
[1;34mINFO] Excluding com.google.guava:guava:jar:11.0.1 from the shaded jar.
{code}

You can see that slf4j-api is version *1.6.4*.  All other parquet modules are 
using *1.7.x*.

1.6.4 is being brought in by some old dependencies (primarily 
{{com.twitter.elephantbird}}).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (PARQUET-1752) Remove slf4j-log4j12 Binding from parquet-protobuf Module

2020-01-10 Thread David Mollitor (Jira)
David Mollitor created PARQUET-1752:
---

 Summary: Remove slf4j-log4j12 Binding from parquet-protobuf Module
 Key: PARQUET-1752
 URL: https://issues.apache.org/jira/browse/PARQUET-1752
 Project: Parquet
  Issue Type: Improvement
Reporter: David Mollitor
Assignee: David Mollitor


{code:none}
Running org.apache.parquet.proto.ProtoInputOutputFormatTest
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in 
[jar:file:/m2/org/slf4j/slf4j-log4j12/1.7.10/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in 
[jar:file:/m2/org/slf4j/slf4j-simple/1.7.22/slf4j-simple-1.7.22.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
{code}

There are two bindings being included and it produces this warning.  This is 
also a log4j properties file in the test resources, but all it does is produce 
logging to the console.  Just stick with the {{slf4j-simple}} logger for 
testing.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (PARQUET-1752) Remove slf4j-log4j12 Binding from parquet-protobuf Module

2020-01-10 Thread David Mollitor (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Mollitor updated PARQUET-1752:

Description: 
{code:none}
Running org.apache.parquet.proto.ProtoInputOutputFormatTest
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in 
[jar:file:/m2/org/slf4j/slf4j-log4j12/1.7.10/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in 
[jar:file:/m2/org/slf4j/slf4j-simple/1.7.22/slf4j-simple-1.7.22.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
{code}

There are two bindings being included and it produces this warning.  
{{slf4j-log4j12}} is coming in as a transient dependency.  There is also a 
log4j properties file in the test resources, but all it does is produce logging 
to the console.  Just stick with the {{slf4j-simple}} logger for testing (which 
is already explicitly specified for testing)

  was:
{code:none}
Running org.apache.parquet.proto.ProtoInputOutputFormatTest
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in 
[jar:file:/m2/org/slf4j/slf4j-log4j12/1.7.10/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in 
[jar:file:/m2/org/slf4j/slf4j-simple/1.7.22/slf4j-simple-1.7.22.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
{code}

There are two bindings being included and it produces this warning.  This is 
also a log4j properties file in the test resources, but all it does is produce 
logging to the console.  Just stick with the {{slf4j-simple}} logger for 
testing.


> Remove slf4j-log4j12 Binding from parquet-protobuf Module
> -
>
> Key: PARQUET-1752
> URL: https://issues.apache.org/jira/browse/PARQUET-1752
> Project: Parquet
>  Issue Type: Improvement
>    Reporter: David Mollitor
>    Assignee: David Mollitor
>Priority: Minor
>
> {code:none}
> Running org.apache.parquet.proto.ProtoInputOutputFormatTest
> SLF4J: Class path contains multiple SLF4J bindings.
> SLF4J: Found binding in 
> [jar:file:/m2/org/slf4j/slf4j-log4j12/1.7.10/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: Found binding in 
> [jar:file:/m2/org/slf4j/slf4j-simple/1.7.22/slf4j-simple-1.7.22.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an 
> explanation.
> SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
> {code}
> There are two bindings being included and it produces this warning.  
> {{slf4j-log4j12}} is coming in as a transient dependency.  There is also a 
> log4j properties file in the test resources, but all it does is produce 
> logging to the console.  Just stick with the {{slf4j-simple}} logger for 
> testing (which is already explicitly specified for testing)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (PARQUET-1751) Fix Protobuf Build Warning

2020-01-10 Thread David Mollitor (Jira)
David Mollitor created PARQUET-1751:
---

 Summary: Fix Protobuf Build Warning
 Key: PARQUET-1751
 URL: https://issues.apache.org/jira/browse/PARQUET-1751
 Project: Parquet
  Issue Type: Improvement
Reporter: David Mollitor


{code:none}
[libprotobuf WARNING google/protobuf/compiler/parser.cc:546] No syntax 
specified for the proto file: TestProtobuf.proto. Please use 'syntax = 
"proto2";' or 'syntax = "proto3";' to specify a syntax version. (Defaulted to 
proto2 syntax.)
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (PARQUET-1751) Fix Protobuf Build Warning

2020-01-10 Thread David Mollitor (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1751?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Mollitor reassigned PARQUET-1751:
---

Assignee: David Mollitor

> Fix Protobuf Build Warning
> --
>
> Key: PARQUET-1751
> URL: https://issues.apache.org/jira/browse/PARQUET-1751
> Project: Parquet
>  Issue Type: Improvement
>    Reporter: David Mollitor
>    Assignee: David Mollitor
>Priority: Trivial
>
> {code:none}
> [libprotobuf WARNING google/protobuf/compiler/parser.cc:546] No syntax 
> specified for the proto file: TestProtobuf.proto. Please use 'syntax = 
> "proto2";' or 'syntax = "proto3";' to specify a syntax version. (Defaulted to 
> proto2 syntax.)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (PARQUET-1750) Reduce Memory Usage of RowRanges Class

2020-01-10 Thread David Mollitor (Jira)
David Mollitor created PARQUET-1750:
---

 Summary: Reduce Memory Usage of RowRanges Class
 Key: PARQUET-1750
 URL: https://issues.apache.org/jira/browse/PARQUET-1750
 Project: Parquet
  Issue Type: Improvement
Reporter: David Mollitor
Assignee: David Mollitor


{{RowRanges}} maintains an internal {{ArrayList}} with a default capacity (10). 
 However, sometimes it is known ahead of time that only a single instance of 
{{Range}} will be added.  For these cases, do not instantiate an {{ArrayList}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (PARQUET-1749) Use Java 8 Streams for Empty PrimitiveIterator

2020-01-10 Thread David Mollitor (Jira)
David Mollitor created PARQUET-1749:
---

 Summary: Use Java 8 Streams for Empty PrimitiveIterator
 Key: PARQUET-1749
 URL: https://issues.apache.org/jira/browse/PARQUET-1749
 Project: Parquet
  Issue Type: Improvement
Reporter: David Mollitor
Assignee: David Mollitor






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (PARQUET-1737) Replace Test Class RandomStr with Apache Commons Lang

2020-01-06 Thread David Mollitor (Jira)
David Mollitor created PARQUET-1737:
---

 Summary: Replace Test Class RandomStr with Apache Commons Lang
 Key: PARQUET-1737
 URL: https://issues.apache.org/jira/browse/PARQUET-1737
 Project: Parquet
  Issue Type: Improvement
Reporter: David Mollitor
Assignee: David Mollitor






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (PARQUET-1736) Use StringBuilder instead of StringBuffer

2020-01-06 Thread David Mollitor (Jira)
David Mollitor created PARQUET-1736:
---

 Summary: Use StringBuilder instead of StringBuffer
 Key: PARQUET-1736
 URL: https://issues.apache.org/jira/browse/PARQUET-1736
 Project: Parquet
  Issue Type: Improvement
Reporter: David Mollitor
Assignee: David Mollitor


StringBuffer is synchronized and therefore incurs the overhead even when it's 
not being used in a multi-threaded way.  Use the unsynchronized StringBuilder 
instead.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (PARQUET-1735) Clean Up parquet-columns Module

2020-01-05 Thread David Mollitor (Jira)
David Mollitor created PARQUET-1735:
---

 Summary: Clean Up parquet-columns Module
 Key: PARQUET-1735
 URL: https://issues.apache.org/jira/browse/PARQUET-1735
 Project: Parquet
  Issue Type: Improvement
Reporter: David Mollitor
Assignee: David Mollitor


{code:none}
Remove unused imports
Remove unused local variables
Add missing '@Override' annotations
Add missing '@Override' annotations to implementations of interface methods
Add missing '@Deprecated' annotations
Remove unnecessary casts
Remove redundant semicolons
Remove unnecessary '$NON-NLS$' tags
Remove redundant type arguments
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (PARQUET-1732) Call toArray With Empty Array

2019-12-27 Thread David Mollitor (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1732?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Mollitor updated PARQUET-1732:

Description: 
[https://stackoverflow.com/questions/9572795/convert-list-to-array-in-java]

 
{quote}It is recommended now to use list.toArray(new Foo[0]);, not 
list.toArray(new Foo[list.size()]);.
{quote}
... less code too :)

  was:
[https://stackoverflow.com/questions/9572795/convert-list-to-array-in-java]

 

{quote}

It is recommended now to use list.toArray(new Foo[0]);, not list.toArray(new 
Foo[list.size()]);.

{quote}


> Call toArray With Empty Array
> -
>
> Key: PARQUET-1732
> URL: https://issues.apache.org/jira/browse/PARQUET-1732
> Project: Parquet
>  Issue Type: Improvement
>    Reporter: David Mollitor
>    Assignee: David Mollitor
>Priority: Minor
>
> [https://stackoverflow.com/questions/9572795/convert-list-to-array-in-java]
>  
> {quote}It is recommended now to use list.toArray(new Foo[0]);, not 
> list.toArray(new Foo[list.size()]);.
> {quote}
> ... less code too :)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (PARQUET-1732) Call toArray With Empty Array

2019-12-27 Thread David Mollitor (Jira)
David Mollitor created PARQUET-1732:
---

 Summary: Call toArray With Empty Array
 Key: PARQUET-1732
 URL: https://issues.apache.org/jira/browse/PARQUET-1732
 Project: Parquet
  Issue Type: Improvement
Reporter: David Mollitor
Assignee: David Mollitor


[https://stackoverflow.com/questions/9572795/convert-list-to-array-in-java]

 

{quote}

It is recommended now to use list.toArray(new Foo[0]);, not list.toArray(new 
Foo[list.size()]);.

{quote}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (PARQUET-1731) Use JDK 8 Facilities to Simplify FilteringRecordMaterializer

2019-12-27 Thread David Mollitor (Jira)
David Mollitor created PARQUET-1731:
---

 Summary: Use JDK 8 Facilities to Simplify 
FilteringRecordMaterializer
 Key: PARQUET-1731
 URL: https://issues.apache.org/jira/browse/PARQUET-1731
 Project: Parquet
  Issue Type: Improvement
Reporter: David Mollitor
Assignee: David Mollitor






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (PARQUET-1729) Avoid AutoBoxing in EncodingStats

2019-12-24 Thread David Mollitor (Jira)
David Mollitor created PARQUET-1729:
---

 Summary: Avoid AutoBoxing in EncodingStats
 Key: PARQUET-1729
 URL: https://issues.apache.org/jira/browse/PARQUET-1729
 Project: Parquet
  Issue Type: Improvement
Reporter: David Mollitor
Assignee: David Mollitor


Use AtomicInteger instead of a Java immutable Integer type which must be 
un-boxed, and re-boxed each time.

 

[https://www.programcreek.com/2013/10/efficient-counter-in-java/]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (PARQUET-1728) Simplify NullPointerException Handling in AvroWriteSupport

2019-12-24 Thread David Mollitor (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Mollitor updated PARQUET-1728:

Summary: Simplify NullPointerException Handling in AvroWriteSupport  (was: 
Simplify Handle NullPointerException Handling in AvroWriteSupport)

> Simplify NullPointerException Handling in AvroWriteSupport
> --
>
> Key: PARQUET-1728
> URL: https://issues.apache.org/jira/browse/PARQUET-1728
> Project: Parquet
>  Issue Type: Improvement
>    Reporter: David Mollitor
>    Assignee: David Mollitor
>Priority: Minor
>
> * Use Java Collection API to simplify
>  * Remove new-line character from logging to play nice with 'grep'



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (PARQUET-1728) Simplify Handle NullPointerException Handling in AvroWriteSupport

2019-12-24 Thread David Mollitor (Jira)
David Mollitor created PARQUET-1728:
---

 Summary: Simplify Handle NullPointerException Handling in 
AvroWriteSupport
 Key: PARQUET-1728
 URL: https://issues.apache.org/jira/browse/PARQUET-1728
 Project: Parquet
  Issue Type: Improvement
Reporter: David Mollitor
Assignee: David Mollitor


* Use Java Collection API to simplify
 * Remove new-line character from logging to play nice with 'grep'



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (PARQUET-1727) Do Not Swallow InterruptedException in ParquetLoader

2019-12-24 Thread David Mollitor (Jira)
David Mollitor created PARQUET-1727:
---

 Summary: Do Not Swallow InterruptedException in ParquetLoader
 Key: PARQUET-1727
 URL: https://issues.apache.org/jira/browse/PARQUET-1727
 Project: Parquet
  Issue Type: Improvement
Reporter: David Mollitor
Assignee: David Mollitor






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (PARQUET-1726) Use Java 8 Multi Exception Handling

2019-12-23 Thread David Mollitor (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Mollitor updated PARQUET-1726:

Description: Simplify the code and removes lines of code

> Use Java 8 Multi Exception Handling
> ---
>
> Key: PARQUET-1726
> URL: https://issues.apache.org/jira/browse/PARQUET-1726
> Project: Parquet
>  Issue Type: Improvement
>    Reporter: David Mollitor
>    Assignee: David Mollitor
>Priority: Minor
>  Labels: pull-request-available
>
> Simplify the code and removes lines of code



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (PARQUET-1726) Use Java 8 Multi Exception Handling

2019-12-23 Thread David Mollitor (Jira)
David Mollitor created PARQUET-1726:
---

 Summary: Use Java 8 Multi Exception Handling
 Key: PARQUET-1726
 URL: https://issues.apache.org/jira/browse/PARQUET-1726
 Project: Parquet
  Issue Type: Improvement
Reporter: David Mollitor
Assignee: David Mollitor






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (PARQUET-1725) Replace Usage of Strings.join with JDK Functionality in ColumnPath Class

2019-12-23 Thread David Mollitor (Jira)
David Mollitor created PARQUET-1725:
---

 Summary: Replace Usage of Strings.join with JDK Functionality in 
ColumnPath Class
 Key: PARQUET-1725
 URL: https://issues.apache.org/jira/browse/PARQUET-1725
 Project: Parquet
  Issue Type: Improvement
Reporter: David Mollitor
Assignee: David Mollitor






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (PARQUET-1724) Use ConcurrentHashMap for Cache in DictionaryPageReader

2019-12-23 Thread David Mollitor (Jira)
David Mollitor created PARQUET-1724:
---

 Summary: Use ConcurrentHashMap for Cache in DictionaryPageReader
 Key: PARQUET-1724
 URL: https://issues.apache.org/jira/browse/PARQUET-1724
 Project: Parquet
  Issue Type: Improvement
Reporter: David Mollitor
Assignee: David Mollitor


* Use ConcurrentHashMap for Cache in DictionaryPageReader
 * Use Java 1.8 APIs



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (PARQUET-1723) Read From Maps Without Using Contains

2019-12-23 Thread David Mollitor (Jira)
David Mollitor created PARQUET-1723:
---

 Summary: Read From Maps Without Using Contains
 Key: PARQUET-1723
 URL: https://issues.apache.org/jira/browse/PARQUET-1723
 Project: Parquet
  Issue Type: Improvement
Reporter: David Mollitor
Assignee: David Mollitor


I see a few places with the following pattern...

 

{code:java}

if (map.contains(key)) {

   return map.get(key);

}
{code}

Better to just call {{get()}} and then check the return value for 'null' to 
determine if the key is there.  This prevents the need to traverse the {{Map}} 
twice,... once for {{contains}} and once for {{get}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (PARQUET-1710) Use Objects.requireNonNull

2019-12-03 Thread David Mollitor (Jira)
David Mollitor created PARQUET-1710:
---

 Summary: Use Objects.requireNonNull
 Key: PARQUET-1710
 URL: https://issues.apache.org/jira/browse/PARQUET-1710
 Project: Parquet
  Issue Type: Improvement
Reporter: David Mollitor
Assignee: David Mollitor


https://docs.oracle.com/javase/8/docs/api/java/util/Objects.html#requireNonNull-T-java.lang.String-



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: Parquet vs. other Open Source Columnar Formats

2019-05-09 Thread David Mollitor
I'm sure there are many different opinions on the matter, but in regards to
Avro, I would say it is becoming more and more of a niche player.

Many folks are choosing to go with Google Protobufs for RPC and Parquet/ORC
for analytic workloads.

On Thu, May 9, 2019 at 2:30 PM Brian Bowman  wrote:

> All,
>
> Is it fair to say that Parquet is fast becoming the dominate open source
> columnar storage format?   How do those of you with long-term Hadoop
> experience see this?  For example, is Parquet overtaking ORC and Avro?
>
> Thanks,
>
> Brian
>