Re: Supporting attribute in Parquet schema

2016-06-30 Thread Julien Le Dem
You can store arbitrary key values alongside the schema in the footer:
https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L565
 

struct FileMetaData {
  /** Version of this file **/
  1: required i32 version

  /** Parquet schema for this file.  This schema contains metadata for all the 
columns.
   * The schema is represented as a tree with a single root.  The nodes of the 
tree
   * are flattened to a list by doing a depth-first traversal.
   * The column metadata contains the path in the schema for that column which 
can be
   * used to map columns to nodes in the schema.
   * The first element is the root **/
  2: required list schema;

  /** Number of rows in this file **/
  3: required i64 num_rows

  /** Row groups in this file **/
  4: required list row_groups

  /** Optional key/value metadata **/
  5: optional list key_value_metadata

  /** String for application that wrote this file.  This should be in the format
   *  version  (build ).
   * e.g. impala version 1.0 (build 6cf94d29b2b7115df4de2c06e2ab4326d721eb55)
   **/
  6: optional string created_by
}

You could make the key something like "{some unique name prefix specific to 
you}.PII.columns”=a.b.c,d.e.f


> On Jun 30, 2016, at 10:44 AM, Mohammad Islam  
> wrote:
> 
> Hi All,
> What is the best way of tagging any field schema with metadata? Does Parquet 
> support it? I think Avro has "doc" attribute. Also Hive schema has "comments".
> I need to tag each field whether it is PII or not. I think someone may want 
> to add description of a field as well.
> Regards,Mohammad
> 



Re: Supporting attribute in Parquet schema

2016-06-30 Thread Nong Li
Columns have support for key/value pairs in the metadata:

https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L489

Let me know if that works for you.

On Thu, Jun 30, 2016 at 10:44 AM, Mohammad Islam  wrote:

> Hi All,
> What is the best way of tagging any field schema with metadata? Does
> Parquet support it? I think Avro has "doc" attribute. Also Hive schema has
> "comments".
> I need to tag each field whether it is PII or not. I think someone may
> want to add description of a field as well.
> Regards,Mohammad
>
>


Supporting attribute in Parquet schema

2016-06-30 Thread Mohammad Islam
Hi All,
What is the best way of tagging any field schema with metadata? Does Parquet 
support it? I think Avro has "doc" attribute. Also Hive schema has "comments".
I need to tag each field whether it is PII or not. I think someone may want to 
add description of a field as well.
Regards,Mohammad



[jira] [Updated] (PARQUET-612) Add compression to FileEncodingIT tests

2016-06-30 Thread Julien Le Dem (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Le Dem updated PARQUET-612:
--
Assignee: Ryan Blue

> Add compression to FileEncodingIT tests
> ---
>
> Key: PARQUET-612
> URL: https://issues.apache.org/jira/browse/PARQUET-612
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Reporter: Ryan Blue
>Assignee: Ryan Blue
> Fix For: 1.9.0
>
>
> The {{FileEncodingsIT}} test validates that pages can be read independently 
> with all encodings, without compression. Pages should not depend on one 
> another for compression to be correct as well, so we should extend this test 
> to use the other compression codecs.
> This test is already expensive, so I propose adding an environment variable 
> to add more compression codecs. That way this results in no extra build/test 
> time, but we can turn on more validation in Travis CI.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (PARQUET-642) Improve performance of ByteBuffer based read / write paths

2016-06-30 Thread Julien Le Dem (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Le Dem resolved PARQUET-642.
---
   Resolution: Fixed
Fix Version/s: 1.9.0

Issue resolved by pull request 347
[https://github.com/apache/parquet-mr/pull/347]

> Improve performance of ByteBuffer based read / write paths
> --
>
> Key: PARQUET-642
> URL: https://issues.apache.org/jira/browse/PARQUET-642
> Project: Parquet
>  Issue Type: Bug
>Reporter: Piyush Narang
>Assignee: Piyush Narang
> Fix For: 1.9.0
>
>
> While trying out the newest Parquet version, we noticed that the changes to 
> start using ByteBuffers: 
> https://github.com/apache/parquet-mr/commit/6b605a4ea05b66e1a6bf843353abcb4834a4ced8
>  and 
> https://github.com/apache/parquet-mr/commit/6b24a1d1b5e2792a7821ad172a45e38d2b04f9b8
>  (mostly avro but a couple of ByteBuffer changes) caused our jobs to slow 
> down a bit. 
> Read overhead: 4-6% (in MB_Millis)
> Write overhead: 6-10% (MB_Millis). 
> Seems like this seems to be due to the encoding / decoding of Strings in the 
> Binary class 
> (https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/io/api/Binary.java)
>  - toStringUsingUTF8() - for reads
> encodeUTF8() - for writes
> In those methods we're using the nio Charsets for encode / decode:
> {code}
> private static ByteBuffer encodeUTF8(CharSequence value) {
>   try {
> return ENCODER.get().encode(CharBuffer.wrap(value));
>   } catch (CharacterCodingException e) {
> throw new ParquetEncodingException("UTF-8 not supported.", e);
>   }
> }
>   }
> ...
> @Override
> public String toStringUsingUTF8() {
>   int limit = value.limit();
>   value.limit(offset+length);
>   int position = value.position();
>   value.position(offset);
>   // no corresponding interface to read a subset of a buffer, would have 
> to slice it
>   // which creates another ByteBuffer object or do what is done here to 
> adjust the
>   // limit/offset and set them back after
>   String ret = UTF8.decode(value).toString();
>   value.limit(limit);
>   value.position(position);
>   return ret;
> }
> {code}
> Tried out some micro / macro benchmarks and it seems like switching those out 
> to using the String class for the encoding / decoding improves performance:
> {code}
> @Override
> public String toStringUsingUTF8() {
>   String ret;
>   if (value.hasArray()) {
> try {
>   ret = new String(value.array(), value.arrayOffset() + offset, 
> length, "UTF-8");
> } catch (UnsupportedEncodingException e) {
>   throw new ParquetDecodingException("UTF-8 not supported");
> }
>   } else {
> int limit = value.limit();
> value.limit(offset+length);
> int position = value.position();
> value.position(offset);
> // no corresponding interface to read a subset of a buffer, would 
> have to slice it
> // which creates another ByteBuffer object or do what is done here to 
> adjust the
> // limit/offset and set them back after
> ret = UTF8.decode(value).toString();
> value.limit(limit);
> value.position(position);
>   }
>   return ret;
> }
> ...
> private static ByteBuffer encodeUTF8(String value) {
>   try {
> return ByteBuffer.wrap(value.getBytes("UTF-8"));
>   } catch (UnsupportedEncodingException e) {
> throw new ParquetEncodingException("UTF-8 not supported.", e);
>   }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PARQUET-642) Improve performance of ByteBuffer based read / write paths

2016-06-30 Thread Julien Le Dem (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Le Dem updated PARQUET-642:
--
Assignee: Piyush Narang

> Improve performance of ByteBuffer based read / write paths
> --
>
> Key: PARQUET-642
> URL: https://issues.apache.org/jira/browse/PARQUET-642
> Project: Parquet
>  Issue Type: Bug
>Reporter: Piyush Narang
>Assignee: Piyush Narang
> Fix For: 1.9.0
>
>
> While trying out the newest Parquet version, we noticed that the changes to 
> start using ByteBuffers: 
> https://github.com/apache/parquet-mr/commit/6b605a4ea05b66e1a6bf843353abcb4834a4ced8
>  and 
> https://github.com/apache/parquet-mr/commit/6b24a1d1b5e2792a7821ad172a45e38d2b04f9b8
>  (mostly avro but a couple of ByteBuffer changes) caused our jobs to slow 
> down a bit. 
> Read overhead: 4-6% (in MB_Millis)
> Write overhead: 6-10% (MB_Millis). 
> Seems like this seems to be due to the encoding / decoding of Strings in the 
> Binary class 
> (https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/io/api/Binary.java)
>  - toStringUsingUTF8() - for reads
> encodeUTF8() - for writes
> In those methods we're using the nio Charsets for encode / decode:
> {code}
> private static ByteBuffer encodeUTF8(CharSequence value) {
>   try {
> return ENCODER.get().encode(CharBuffer.wrap(value));
>   } catch (CharacterCodingException e) {
> throw new ParquetEncodingException("UTF-8 not supported.", e);
>   }
> }
>   }
> ...
> @Override
> public String toStringUsingUTF8() {
>   int limit = value.limit();
>   value.limit(offset+length);
>   int position = value.position();
>   value.position(offset);
>   // no corresponding interface to read a subset of a buffer, would have 
> to slice it
>   // which creates another ByteBuffer object or do what is done here to 
> adjust the
>   // limit/offset and set them back after
>   String ret = UTF8.decode(value).toString();
>   value.limit(limit);
>   value.position(position);
>   return ret;
> }
> {code}
> Tried out some micro / macro benchmarks and it seems like switching those out 
> to using the String class for the encoding / decoding improves performance:
> {code}
> @Override
> public String toStringUsingUTF8() {
>   String ret;
>   if (value.hasArray()) {
> try {
>   ret = new String(value.array(), value.arrayOffset() + offset, 
> length, "UTF-8");
> } catch (UnsupportedEncodingException e) {
>   throw new ParquetDecodingException("UTF-8 not supported");
> }
>   } else {
> int limit = value.limit();
> value.limit(offset+length);
> int position = value.position();
> value.position(offset);
> // no corresponding interface to read a subset of a buffer, would 
> have to slice it
> // which creates another ByteBuffer object or do what is done here to 
> adjust the
> // limit/offset and set them back after
> ret = UTF8.decode(value).toString();
> value.limit(limit);
> value.position(position);
>   }
>   return ret;
> }
> ...
> private static ByteBuffer encodeUTF8(String value) {
>   try {
> return ByteBuffer.wrap(value.getBytes("UTF-8"));
>   } catch (UnsupportedEncodingException e) {
> throw new ParquetEncodingException("UTF-8 not supported.", e);
>   }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (PARQUET-645) DictionaryFilter incorrectly handles null

2016-06-30 Thread Julien Le Dem (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Le Dem resolved PARQUET-645.
---
Resolution: Fixed

Issue resolved by pull request 348
[https://github.com/apache/parquet-mr/pull/348]

> DictionaryFilter incorrectly handles null
> -
>
> Key: PARQUET-645
> URL: https://issues.apache.org/jira/browse/PARQUET-645
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.9.0
>Reporter: Ryan Blue
>Assignee: Ryan Blue
> Fix For: 1.9.0
>
>
> DictionaryFilter checks whether a column can match a query and filters out 
> row groups that can't match. Equality checks don't currently handle null 
> correctly, which is never in the dictionary and is encoded by the definition 
> level. This is causing row groups to be filtered when they should not be 
> because "col is null" is always true.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (PARQUET-544) ParquetWriter.close() throws NullPointerException on second call, improper implementation of Closeable contract

2016-06-30 Thread Julien Le Dem (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Le Dem resolved PARQUET-544.
---
   Resolution: Fixed
Fix Version/s: 1.9.0

Issue resolved by pull request 345
[https://github.com/apache/parquet-mr/pull/345]

> ParquetWriter.close() throws NullPointerException on second call, improper 
> implementation of Closeable contract
> ---
>
> Key: PARQUET-544
> URL: https://issues.apache.org/jira/browse/PARQUET-544
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.8.1
>Reporter: Michal Turek
>Assignee: Michal Turek
>Priority: Minor
> Fix For: 1.9.0
>
>
> {{org.apache.parquet.hadoop.ParquetWriter}} implements 
> {{java.util.Closeable}}, but its {{close()}} method doesn't follow its 
> contract properly. The interface defines "If the stream is already closed 
> then invoking this method has no effect.", but {{ParquetWriter}} instead 
> throws {{NullPointerException}}.
> It's source is quite obvious, {{columnStore}} is set to null and then 
> accessed again. There is no "if already closed" condition to prevent it.
> {noformat}
> java.lang.NullPointerException: null
>   at 
> org.apache.parquet.hadoop.InternalParquetRecordWriter.flushRowGroupToStore(InternalParquetRecordWriter.java:157)
>  ~[parquet-hadoop-1.8.1.jar:1.8.1]
>   at 
> org.apache.parquet.hadoop.InternalParquetRecordWriter.close(InternalParquetRecordWriter.java:113)
>  ~[parquet-hadoop-1.8.1.jar:1.8.1]
>   at 
> org.apache.parquet.hadoop.ParquetWriter.close(ParquetWriter.java:297) 
> ~[parquet-hadoop-1.8.1.jar:1.8.1]
> {noformat}
> {noformat}
>   private void flushRowGroupToStore()
>   throws IOException {
> LOG.info(format("Flushing mem columnStore to file. allocated memory: 
> %,d", columnStore.getAllocatedSize()));
> if (columnStore.getAllocatedSize() > (3 * rowGroupSizeThreshold)) {
>   LOG.warn("Too much memory used: " + columnStore.memUsageString());
> }
> if (recordCount > 0) {
>   parquetFileWriter.startBlock(recordCount);
>   columnStore.flush();
>   pageStore.flushToFileWriter(parquetFileWriter);
>   recordCount = 0;
>   parquetFileWriter.endBlock();
>   this.nextRowGroupSize = Math.min(
>   parquetFileWriter.getNextRowGroupSize(),
>   rowGroupSizeThreshold);
> }
> columnStore = null;
> pageStore = null;
>   }
> {noformat}
> Known workaround is to prevent the second and other closes explicitly in the 
> application code.
> {noformat}
> private final ParquetWriter writer;
> private boolean closed;
> private void closeWriterOnlyOnce() throws IOException {
> if (!closed) {
> closed = true;
> writer.close();
> }
> }
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)