Re: String interning in parquet-format

2018-04-03 Thread Julien Le Dem
The main reason for the string interning is saving memory. Some of the
early parquet design is using the column names in the metadata to refer to
columns. When deserializing metadata we have a new string instance when we
deserialize even though it is the same string. We don't need to rely on the
interning mechanism it was just convenient to dedup the strings.
If there's a better mechanism to do that then it is fine to replace
interning with it.
Cheers
Julien

On Tue, Apr 3, 2018 at 12:15 PM, Robert Kruszewski 
wrote:

> I have been pointed to https://github.com/apache/parquet-format/pull/2
> which is the orignal pr for parquet-11. Looking at
> http://hg.openjdk.java.net/jdk10/master/file/be620a591379/src/hotspot/
> share/gc/cms/concurrentMarkSweepGeneration.cpp#l2563
> and
> http://hg.openjdk.java.net/jdk10/master/file/be620a591379/src/hotspot/
> share/gc/cms/concurrentMarkSweepGeneration.cpp#l5261
> it does look like interned strings are very rarely gc'ed.
>
> On Tue, 3 Apr 2018 at 18:45 Robert Kruszewski  wrote:
>
> > Hi parquet-dev,
> >
> > I wanted to start a discussion around the existence of string interning
> in
> > the thrift protocol in parquet-format. I posted some links
> > https://issues.apache.org/jira/browse/PARQUET-1261 and while I haven't
> > done perf benchmarking I have previously seen interened strings to cause
> GC
> > overhead limit exceeded exceptions. Only reference I could find why this
> > has been added is reference to
> > https://issues.apache.org/jira/browse/PARQUET-11 which unfortunately
> > leads to deleted repo. Wonder if anyone remembers the exact details?
> >
> > If we deem string deduplication there to be necessary we should
> > investigate implementing simple cache instead. I'd hope we can simply get
> > rid of interning without much harm but would love to hear others
> opinions.
> >
> > Robert
> >
>


String interning in parquet-format

2018-04-03 Thread Robert Kruszewski
Hi parquet-dev,

I wanted to start a discussion around the existence of string interning in
the thrift protocol in parquet-format. I posted some links
https://issues.apache.org/jira/browse/PARQUET-1261 and while I haven't done
perf benchmarking I have previously seen interened strings to cause GC
overhead limit exceeded exceptions. Only reference I could find why this
has been added is reference to
https://issues.apache.org/jira/browse/PARQUET-11 which unfortunately leads
to deleted repo. Wonder if anyone remembers the exact details?

If we deem string deduplication there to be necessary we should investigate
implementing simple cache instead. I'd hope we can simply get rid of
interning without much harm but would love to hear others opinions.

Robert


[jira] [Assigned] (PARQUET-1261) Parquet-format interns strings when reading filemetadata

2018-04-03 Thread Gabor Szadovszky (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-1261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Szadovszky reassigned PARQUET-1261:
-

Assignee: Robert Kruszewski

> Parquet-format interns strings when reading filemetadata
> 
>
> Key: PARQUET-1261
> URL: https://issues.apache.org/jira/browse/PARQUET-1261
> Project: Parquet
>  Issue Type: Bug
>Affects Versions: 1.9.0
>Reporter: Robert Kruszewski
>Assignee: Robert Kruszewski
>Priority: Major
>
> Parquet-format when deserializing metadata will intern strings. References I 
> could find suggested that it had been done to reduce memory pressure early 
> on. Java (and jvm in particular) went a long way since then and interning is 
> generally discouraged, see 
> [https://shipilev.net/jvm-anatomy-park/10-string-intern/] for a good 
> explanation. What is more since java 8 there's string deduplication 
> implemented at GC level per [http://openjdk.java.net/jeps/192.] During our 
> usage and testing we found the interning to cause significant gc pressure for 
> long running applications due to bigger GC root set.
> This issue proposes removing interning given it's questionable whether it 
> should be used in modern jvms.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PARQUET-1253) Support for new logical type representation

2018-04-03 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-1253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16424016#comment-16424016
 ] 

ASF GitHub Bot commented on PARQUET-1253:
-

gszadovszky commented on a change in pull request #463: PARQUET-1253: Support 
for new logical type representation
URL: https://github.com/apache/parquet-mr/pull/463#discussion_r178819326
 
 

 ##
 File path: parquet-column/src/main/java/org/apache/parquet/schema/Types.java
 ##
 @@ -252,7 +252,12 @@ protected final THIS repetition(Type.Repetition 
repetition) {
  * @return this builder for method chaining
  */
 public THIS as(OriginalType type) {
-  this.originalType = type;
+  this.logicalTypeAnnotation = 
LogicalTypeAnnotation.fromOriginalType(type);
+  return self();
+}
+
+public THIS as(LogicalTypeAnnotation type) {
 
 Review comment:
   This method breaks the fluent API of the builder as you need to use an 
outside factory method of the final type to create a LogicalTypeAnnotation. If 
it is practically feasible, I would suggest to refactor the 
LogicalTypeAnnotation API to fit more in the fluent API of Types.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Support for new logical type representation
> ---
>
> Key: PARQUET-1253
> URL: https://issues.apache.org/jira/browse/PARQUET-1253
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Reporter: Nandor Kollar
>Assignee: Nandor Kollar
>Priority: Major
>
> Latest parquet-format 
> [introduced|https://github.com/apache/parquet-format/commit/863875e0be3237c6aa4ed71733d54c91a51deabe#diff-0f9d1b5347959e15259da7ba8f4b6252]
>  a new representation for logical types. As of now this is not yet supported 
> in parquet-mr, thus there's no way to use parametrized UTC normalized 
> timestamp data types. When reading and writing Parquet files, besides 
> 'converted_type' parquet-mr should use the new 'logicalType' field in 
> SchemaElement to tell the current logical type annotation. To maintain 
> backward compatibility, the semantic of converted_type shouldn't change.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PARQUET-1253) Support for new logical type representation

2018-04-03 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-1253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16424020#comment-16424020
 ] 

ASF GitHub Bot commented on PARQUET-1253:
-

gszadovszky commented on a change in pull request #463: PARQUET-1253: Support 
for new logical type representation
URL: https://github.com/apache/parquet-mr/pull/463#discussion_r178810222
 
 

 ##
 File path: 
parquet-column/src/main/java/org/apache/parquet/schema/LogicalTypeAnnotation.java
 ##
 @@ -0,0 +1,748 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.parquet.schema;
+
+import org.apache.parquet.format.BsonType;
+import org.apache.parquet.format.ConvertedType;
+import org.apache.parquet.format.DateType;
+import org.apache.parquet.format.DecimalType;
+import org.apache.parquet.format.EnumType;
+import org.apache.parquet.format.IntType;
+import org.apache.parquet.format.JsonType;
+import org.apache.parquet.format.ListType;
+import org.apache.parquet.format.LogicalType;
+import org.apache.parquet.format.MapType;
+import org.apache.parquet.format.MicroSeconds;
+import org.apache.parquet.format.MilliSeconds;
+import org.apache.parquet.format.NullType;
+import org.apache.parquet.format.StringType;
+import org.apache.parquet.format.TimeType;
+import org.apache.parquet.format.TimestampType;
+
+import java.util.Objects;
+
+public interface LogicalTypeAnnotation {
+  /**
+   * Convert this parquet-mr logical type to parquet-format LogicalType.
+   *
+   * @return the parquet-format LogicalType representation of this logical 
type implementation
+   */
+  LogicalType toLogicalType();
+
+  /**
+   * Convert this parquet-mr logical type to parquet-format ConvertedType.
+   *
+   * @return the parquet-format ConvertedType representation of this logical 
type implementation
+   */
+  ConvertedType toConvertedType();
+
+  /**
+   * Convert this logical type to old logical type representation in 
parquet-mr (if there's any).
+   * Those logical type implementations, which don't have a corresponding 
mapping should return null.
+   *
+   * @return the OriginalType representation of the new logical type, or null 
if there's none
+   */
+  OriginalType toOriginalType();
+
+  /**
+   * Helper method to convert the old representation of logical types 
(OriginalType) to new logical type.
+   */
+  static LogicalTypeAnnotation fromOriginalType(OriginalType originalType) {
+if (originalType == null) {
+  return null;
+}
+switch (originalType) {
+  case UTF8:
+return StringLogicalTypeAnnotation.create();
+  case MAP:
+return MapLogicalTypeAnnotation.create();
+  case DECIMAL:
+return DecimalLogicalTypeAnnotation.create();
+  case LIST:
+return ListLogicalTypeAnnotation.create();
+  case DATE:
+return DateLogicalTypeAnnotation.create();
+  case INTERVAL:
+return IntervalLogicalTypeAnnotation.create();
+  case TIMESTAMP_MILLIS:
+return TimestampLogicalTypeAnnotation.create(true, 
LogicalTypeAnnotation.TimeUnit.MILLIS);
+  case TIMESTAMP_MICROS:
+return TimestampLogicalTypeAnnotation.create(true, 
LogicalTypeAnnotation.TimeUnit.MICROS);
+  case TIME_MILLIS:
+return TimeLogicalTypeAnnotation.create(true, 
LogicalTypeAnnotation.TimeUnit.MILLIS);
+  case TIME_MICROS:
+return TimeLogicalTypeAnnotation.create(true, 
LogicalTypeAnnotation.TimeUnit.MICROS);
+  case UINT_8:
+return IntLogicalTypeAnnotation.create((byte) 8, false);
+  case UINT_16:
+return IntLogicalTypeAnnotation.create((byte) 16, false);
+  case UINT_32:
+return IntLogicalTypeAnnotation.create((byte) 32, false);
+  case UINT_64:
+return IntLogicalTypeAnnotation.create((byte) 64, false);
+  case INT_8:
+return IntLogicalTypeAnnotation.create((byte) 8, true);
+  case INT_16:
+return IntLogicalTypeAnnotation.create((byte) 16, true);
+  case INT_32:
+return IntLogicalTypeAnnotation.create((byte) 32, true);
+  case INT_64:
+return IntLogicalTypeAnnotation.create((byte) 64, true);
+  case ENUM:
+  

[jira] [Commented] (PARQUET-1253) Support for new logical type representation

2018-04-03 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-1253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16424018#comment-16424018
 ] 

ASF GitHub Bot commented on PARQUET-1253:
-

gszadovszky commented on a change in pull request #463: PARQUET-1253: Support 
for new logical type representation
URL: https://github.com/apache/parquet-mr/pull/463#discussion_r178809713
 
 

 ##
 File path: 
parquet-column/src/main/java/org/apache/parquet/schema/LogicalTypeAnnotation.java
 ##
 @@ -0,0 +1,748 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.parquet.schema;
+
+import org.apache.parquet.format.BsonType;
+import org.apache.parquet.format.ConvertedType;
+import org.apache.parquet.format.DateType;
+import org.apache.parquet.format.DecimalType;
+import org.apache.parquet.format.EnumType;
+import org.apache.parquet.format.IntType;
+import org.apache.parquet.format.JsonType;
+import org.apache.parquet.format.ListType;
+import org.apache.parquet.format.LogicalType;
+import org.apache.parquet.format.MapType;
+import org.apache.parquet.format.MicroSeconds;
+import org.apache.parquet.format.MilliSeconds;
+import org.apache.parquet.format.NullType;
+import org.apache.parquet.format.StringType;
+import org.apache.parquet.format.TimeType;
+import org.apache.parquet.format.TimestampType;
+
+import java.util.Objects;
+
+public interface LogicalTypeAnnotation {
+  /**
+   * Convert this parquet-mr logical type to parquet-format LogicalType.
+   *
+   * @return the parquet-format LogicalType representation of this logical 
type implementation
+   */
+  LogicalType toLogicalType();
+
+  /**
+   * Convert this parquet-mr logical type to parquet-format ConvertedType.
+   *
+   * @return the parquet-format ConvertedType representation of this logical 
type implementation
+   */
+  ConvertedType toConvertedType();
+
+  /**
+   * Convert this logical type to old logical type representation in 
parquet-mr (if there's any).
+   * Those logical type implementations, which don't have a corresponding 
mapping should return null.
+   *
+   * @return the OriginalType representation of the new logical type, or null 
if there's none
+   */
+  OriginalType toOriginalType();
+
+  /**
+   * Helper method to convert the old representation of logical types 
(OriginalType) to new logical type.
+   */
+  static LogicalTypeAnnotation fromOriginalType(OriginalType originalType) {
+if (originalType == null) {
+  return null;
+}
+switch (originalType) {
+  case UTF8:
+return StringLogicalTypeAnnotation.create();
+  case MAP:
+return MapLogicalTypeAnnotation.create();
+  case DECIMAL:
+return DecimalLogicalTypeAnnotation.create();
+  case LIST:
+return ListLogicalTypeAnnotation.create();
+  case DATE:
+return DateLogicalTypeAnnotation.create();
+  case INTERVAL:
+return IntervalLogicalTypeAnnotation.create();
+  case TIMESTAMP_MILLIS:
+return TimestampLogicalTypeAnnotation.create(true, 
LogicalTypeAnnotation.TimeUnit.MILLIS);
+  case TIMESTAMP_MICROS:
+return TimestampLogicalTypeAnnotation.create(true, 
LogicalTypeAnnotation.TimeUnit.MICROS);
+  case TIME_MILLIS:
+return TimeLogicalTypeAnnotation.create(true, 
LogicalTypeAnnotation.TimeUnit.MILLIS);
+  case TIME_MICROS:
+return TimeLogicalTypeAnnotation.create(true, 
LogicalTypeAnnotation.TimeUnit.MICROS);
+  case UINT_8:
+return IntLogicalTypeAnnotation.create((byte) 8, false);
+  case UINT_16:
+return IntLogicalTypeAnnotation.create((byte) 16, false);
+  case UINT_32:
+return IntLogicalTypeAnnotation.create((byte) 32, false);
+  case UINT_64:
+return IntLogicalTypeAnnotation.create((byte) 64, false);
+  case INT_8:
+return IntLogicalTypeAnnotation.create((byte) 8, true);
+  case INT_16:
+return IntLogicalTypeAnnotation.create((byte) 16, true);
+  case INT_32:
+return IntLogicalTypeAnnotation.create((byte) 32, true);
+  case INT_64:
+return IntLogicalTypeAnnotation.create((byte) 64, true);
+  case ENUM:
+  

[jira] [Commented] (PARQUET-1253) Support for new logical type representation

2018-04-03 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-1253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16424015#comment-16424015
 ] 

ASF GitHub Bot commented on PARQUET-1253:
-

gszadovszky commented on a change in pull request #463: PARQUET-1253: Support 
for new logical type representation
URL: https://github.com/apache/parquet-mr/pull/463#discussion_r178799024
 
 

 ##
 File path: 
parquet-column/src/main/java/org/apache/parquet/schema/LogicalTypeAnnotation.java
 ##
 @@ -0,0 +1,748 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.parquet.schema;
+
+import org.apache.parquet.format.BsonType;
+import org.apache.parquet.format.ConvertedType;
+import org.apache.parquet.format.DateType;
+import org.apache.parquet.format.DecimalType;
+import org.apache.parquet.format.EnumType;
+import org.apache.parquet.format.IntType;
+import org.apache.parquet.format.JsonType;
+import org.apache.parquet.format.ListType;
+import org.apache.parquet.format.LogicalType;
+import org.apache.parquet.format.MapType;
+import org.apache.parquet.format.MicroSeconds;
+import org.apache.parquet.format.MilliSeconds;
+import org.apache.parquet.format.NullType;
+import org.apache.parquet.format.StringType;
+import org.apache.parquet.format.TimeType;
+import org.apache.parquet.format.TimestampType;
+
+import java.util.Objects;
+
+public interface LogicalTypeAnnotation {
+  /**
+   * Convert this parquet-mr logical type to parquet-format LogicalType.
+   *
+   * @return the parquet-format LogicalType representation of this logical 
type implementation
+   */
+  LogicalType toLogicalType();
+
+  /**
+   * Convert this parquet-mr logical type to parquet-format ConvertedType.
+   *
+   * @return the parquet-format ConvertedType representation of this logical 
type implementation
+   */
+  ConvertedType toConvertedType();
+
+  /**
+   * Convert this logical type to old logical type representation in 
parquet-mr (if there's any).
+   * Those logical type implementations, which don't have a corresponding 
mapping should return null.
+   *
+   * @return the OriginalType representation of the new logical type, or null 
if there's none
+   */
+  OriginalType toOriginalType();
+
+  /**
+   * Helper method to convert the old representation of logical types 
(OriginalType) to new logical type.
+   */
+  static LogicalTypeAnnotation fromOriginalType(OriginalType originalType) {
+if (originalType == null) {
+  return null;
+}
+switch (originalType) {
+  case UTF8:
+return StringLogicalTypeAnnotation.create();
+  case MAP:
+return MapLogicalTypeAnnotation.create();
+  case DECIMAL:
+return DecimalLogicalTypeAnnotation.create();
+  case LIST:
+return ListLogicalTypeAnnotation.create();
+  case DATE:
+return DateLogicalTypeAnnotation.create();
+  case INTERVAL:
+return IntervalLogicalTypeAnnotation.create();
+  case TIMESTAMP_MILLIS:
+return TimestampLogicalTypeAnnotation.create(true, 
LogicalTypeAnnotation.TimeUnit.MILLIS);
+  case TIMESTAMP_MICROS:
+return TimestampLogicalTypeAnnotation.create(true, 
LogicalTypeAnnotation.TimeUnit.MICROS);
+  case TIME_MILLIS:
+return TimeLogicalTypeAnnotation.create(true, 
LogicalTypeAnnotation.TimeUnit.MILLIS);
+  case TIME_MICROS:
+return TimeLogicalTypeAnnotation.create(true, 
LogicalTypeAnnotation.TimeUnit.MICROS);
+  case UINT_8:
+return IntLogicalTypeAnnotation.create((byte) 8, false);
+  case UINT_16:
+return IntLogicalTypeAnnotation.create((byte) 16, false);
+  case UINT_32:
+return IntLogicalTypeAnnotation.create((byte) 32, false);
+  case UINT_64:
+return IntLogicalTypeAnnotation.create((byte) 64, false);
+  case INT_8:
+return IntLogicalTypeAnnotation.create((byte) 8, true);
+  case INT_16:
+return IntLogicalTypeAnnotation.create((byte) 16, true);
+  case INT_32:
+return IntLogicalTypeAnnotation.create((byte) 32, true);
+  case INT_64:
+return IntLogicalTypeAnnotation.create((byte) 64, true);
+  case ENUM:
+  

[jira] [Commented] (PARQUET-1253) Support for new logical type representation

2018-04-03 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-1253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16424017#comment-16424017
 ] 

ASF GitHub Bot commented on PARQUET-1253:
-

gszadovszky commented on a change in pull request #463: PARQUET-1253: Support 
for new logical type representation
URL: https://github.com/apache/parquet-mr/pull/463#discussion_r178821942
 
 

 ##
 File path: 
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/metadata/ParquetMetadata.java
 ##
 @@ -41,6 +40,10 @@
 
   private static final ObjectMapper objectMapper = new ObjectMapper();
 
+  static {
+objectMapper.configure(SerializationConfig.Feature.FAIL_ON_EMPTY_BEANS, 
false);
 
 Review comment:
   Could you please explain why it is necessary?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Support for new logical type representation
> ---
>
> Key: PARQUET-1253
> URL: https://issues.apache.org/jira/browse/PARQUET-1253
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Reporter: Nandor Kollar
>Assignee: Nandor Kollar
>Priority: Major
>
> Latest parquet-format 
> [introduced|https://github.com/apache/parquet-format/commit/863875e0be3237c6aa4ed71733d54c91a51deabe#diff-0f9d1b5347959e15259da7ba8f4b6252]
>  a new representation for logical types. As of now this is not yet supported 
> in parquet-mr, thus there's no way to use parametrized UTC normalized 
> timestamp data types. When reading and writing Parquet files, besides 
> 'converted_type' parquet-mr should use the new 'logicalType' field in 
> SchemaElement to tell the current logical type annotation. To maintain 
> backward compatibility, the semantic of converted_type shouldn't change.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PARQUET-1253) Support for new logical type representation

2018-04-03 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-1253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16424019#comment-16424019
 ] 

ASF GitHub Bot commented on PARQUET-1253:
-

gszadovszky commented on a change in pull request #463: PARQUET-1253: Support 
for new logical type representation
URL: https://github.com/apache/parquet-mr/pull/463#discussion_r178821039
 
 

 ##
 File path: 
parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java
 ##
 @@ -586,108 +595,105 @@ Type getType(PrimitiveTypeName type) {
   }
 
   // Visible for testing
-  OriginalType getOriginalType(ConvertedType type) {
+  LogicalTypeAnnotation getOriginalType(ConvertedType type, SchemaElement 
schemaElement) {
 switch (type) {
   case UTF8:
-return OriginalType.UTF8;
+return LogicalTypeAnnotation.StringLogicalTypeAnnotation.create();
   case MAP:
-return OriginalType.MAP;
+return LogicalTypeAnnotation.MapLogicalTypeAnnotation.create();
   case MAP_KEY_VALUE:
-return OriginalType.MAP_KEY_VALUE;
+return LogicalTypeAnnotation.MapKeyValueTypeAnnotation.create();
   case LIST:
-return OriginalType.LIST;
+return LogicalTypeAnnotation.ListLogicalTypeAnnotation.create();
   case ENUM:
-return OriginalType.ENUM;
+return LogicalTypeAnnotation.EnumLogicalTypeAnnotation.create();
   case DECIMAL:
-return OriginalType.DECIMAL;
+if (schemaElement == null) {
+  return LogicalTypeAnnotation.DecimalLogicalTypeAnnotation.create();
+}
+return 
LogicalTypeAnnotation.DecimalLogicalTypeAnnotation.create(schemaElement.scale, 
schemaElement.precision);
   case DATE:
-return OriginalType.DATE;
+return LogicalTypeAnnotation.DateLogicalTypeAnnotation.create();
   case TIME_MILLIS:
-return OriginalType.TIME_MILLIS;
+return LogicalTypeAnnotation.TimeLogicalTypeAnnotation.create(true, 
LogicalTypeAnnotation.TimeUnit.MILLIS);
   case TIME_MICROS:
-return OriginalType.TIME_MICROS;
+return LogicalTypeAnnotation.TimeLogicalTypeAnnotation.create(true, 
LogicalTypeAnnotation.TimeUnit.MICROS);
   case TIMESTAMP_MILLIS:
-return OriginalType.TIMESTAMP_MILLIS;
+return 
LogicalTypeAnnotation.TimestampLogicalTypeAnnotation.create(true, 
LogicalTypeAnnotation.TimeUnit.MILLIS);
   case TIMESTAMP_MICROS:
-return OriginalType.TIMESTAMP_MICROS;
+return 
LogicalTypeAnnotation.TimestampLogicalTypeAnnotation.create(true, 
LogicalTypeAnnotation.TimeUnit.MICROS);
   case INTERVAL:
-return OriginalType.INTERVAL;
+return LogicalTypeAnnotation.IntervalLogicalTypeAnnotation.create();
   case INT_8:
-return OriginalType.INT_8;
+return LogicalTypeAnnotation.IntLogicalTypeAnnotation.create((byte) 8, 
true);
   case INT_16:
-return OriginalType.INT_16;
+return LogicalTypeAnnotation.IntLogicalTypeAnnotation.create((byte) 
16, true);
   case INT_32:
-return OriginalType.INT_32;
+return LogicalTypeAnnotation.IntLogicalTypeAnnotation.create((byte) 
32, true);
   case INT_64:
-return OriginalType.INT_64;
+return LogicalTypeAnnotation.IntLogicalTypeAnnotation.create((byte) 
64, true);
   case UINT_8:
-return OriginalType.UINT_8;
+return LogicalTypeAnnotation.IntLogicalTypeAnnotation.create((byte) 8, 
false);
   case UINT_16:
-return OriginalType.UINT_16;
+return LogicalTypeAnnotation.IntLogicalTypeAnnotation.create((byte) 
16, false);
   case UINT_32:
-return OriginalType.UINT_32;
+return LogicalTypeAnnotation.IntLogicalTypeAnnotation.create((byte) 
32, false);
   case UINT_64:
-return OriginalType.UINT_64;
+return LogicalTypeAnnotation.IntLogicalTypeAnnotation.create((byte) 
64, false);
   case JSON:
-return OriginalType.JSON;
+return LogicalTypeAnnotation.JsonLogicalTypeAnnotation.create();
   case BSON:
-return OriginalType.BSON;
+return LogicalTypeAnnotation.BsonLogicalTypeAnnotation.create();
   default:
-throw new RuntimeException("Unknown converted type " + type);
+return LogicalTypeAnnotation.NullLogicalTypeAnnotation.create();
 }
   }
 
-  // Visible for testing
-  ConvertedType getConvertedType(OriginalType type) {
-switch (type) {
-  case UTF8:
-return ConvertedType.UTF8;
+  LogicalTypeAnnotation getOriginalType(LogicalType type) {
+switch (type.getSetField()) {
   case MAP:
-return ConvertedType.MAP;
-  case MAP_KEY_VALUE:
-return ConvertedType.MAP_KEY_VALUE;
-  case LIST:
-return ConvertedType.LIST;
-  case ENUM:
-return ConvertedType.ENUM;
-  case DECIMAL:
-return ConvertedType.DECIMAL;
+return 

Subject: [VOTE] Release Apache Parquet Format 2.5.0 RC0

2018-04-03 Thread Gabor Szadovszky
Hi everyone,

Zoltan and I propose the following RC to be released as official Apache Parquet 
Format 2.5.0 release.

The commit id is f0fa7c14a4699581b41d8ba9aff1512663cc0fb4
* This corresponds to the tag: apache-parquet-format-2.5.0
* 
https://github.com/apache/parquet-format/tree/f0fa7c14a4699581b41d8ba9aff1512663cc0fb4
 


The release tarball, signature, and checksums are here:
* 
https://dist.apache.org/repos/dist/dev/parquet/apache-parquet-format-2.5.0-rc0/ 


You can find the KEYS file here:
* https://dist.apache.org/repos/dist/dev/parquet/KEYS 


Binary artifacts are staged in Nexus here:
* 
https://repository.apache.org/content/groups/staging/org/apache/parquet/parquet-format/
 


This release includes important changes that I should have summarized here, but 
I'm lazy.
See 
https://github.com/apache/parquet-format/blob/f0fa7c14a4699581b41d8ba9aff1512663cc0fb4/CHANGES.md
 

 for details.

Please download, verify, and test.

Please vote by 10AM on Friday, April 6, 2018 (UTC).

[ ] +1 Release this as Apache Parquet Format 2.5.0
[ ] +0
[ ] -1 Do not release this because…



Re: Solution to read/write multiple parquet files

2018-04-03 Thread Lizhou Gao
Thanks for your quick reply!
Given below  scenario, there are 200k rows of sql data, 0-100k contains more 
nulls while 100k-200k contains more not null values.
If we convert two parts into parquet files, we may get 0-100k.parquet (500M), 
100k-200k(1.3G). Then what is the best practice to cutting these rows into
parquet files ?
Another question is that should we keep same RowGroup size for one parquet file 
?  






Thanks,
Lizhou
-- Original --
From:  "Uwe L. Korn";
Date:  Tue, Apr 3, 2018 04:21 PM
To:  "dev"; 

Subject:  Re: Solution to read/write multiple parquet files

 
Hello Lizhou,

on the Python side there is http://dask.pydata.org/en/latest/ that can read 
large, distributed Parquet datasets. When using `engine=pyarrow`, it also uses 
parquet-cpp under the hood.

On the pure C++ side, I know that https://github.com/thrill/thrill has 
experimental parquet support. But this is an experimental feature in an 
experimental framework, so be careful about relying on it.

In general, Parquet files should not exceed the single digit gigabyte size and 
the RowGroups inside these files should also be 128MiB or less. You will be 
able to write tools that can deal with other sizes but that will break a bit 
the portability aspect of Parquet file.

Uwe

On Tue, Apr 3, 2018, at 10:00 AM, 高立周 wrote:
> Hi experts,
>We have a storage engine that needs to manage large set of data (PB 
> level) . Currently we store it as a single parquet file. After some 
> searching, It seems the data should be cut into
> multiple parquet files for further reading/writing/managing.  But I 
> don't know whether there is already opensource solution to read/write/
> manage multiple parquet files.  Our programming language is cpp.
>   Any comments/suggestions are welcomed. Thanks!
> 
> 
> Regards,
> Lizhou

[jira] [Commented] (PARQUET-1248) java.lang.UnsupportedOperationException: Unimplemented type: StringType

2018-04-03 Thread Nandor Kollar (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-1248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16423711#comment-16423711
 ] 

Nandor Kollar commented on PARQUET-1248:


[~modi.shrutika] based on the information you provided, I think this error is 
related to vectorized Parquet reader in Spark, and not a Parquet issue. If you 
take a look at Spark code, there's a TODO in the 
[readIntBatch|https://github.com/apache/spark/blame/master/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedColumnReader.java#L385]
 method, maybe the error you got is related to it (however looks like it tries 
read ints while you have strings)? I'm not familiar with Spark, but maybe you 
have similar problem like SPARK-23822.

>  java.lang.UnsupportedOperationException: Unimplemented type: StringType
> 
>
> Key: PARQUET-1248
> URL: https://issues.apache.org/jira/browse/PARQUET-1248
> Project: Parquet
>  Issue Type: Bug
>Reporter: Shrutika modi
>Priority: Major
>
> java.lang.UnsupportedOperationException: Unimplemented type: StringType
>  at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readIntBatch(VectorizedColumnReader.java:356)
>  at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch(VectorizedColumnReader.java:183)
>  at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:230)
>  at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:137)
>  at 
> org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
>  at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1$$anon$2.getNext(FileScanRDD.scala:133)
>  at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
>  at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:102)
>  at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:166)
>  at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:102)
>  at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.scan_nextBatch$(Unknown
>  Source)
>  at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>  at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>  at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:243)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:190)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:188)
>  at 
> org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1341)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:193)
>  ... 8 more



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: Solution to read/write multiple parquet files

2018-04-03 Thread Uwe L. Korn
Hello Lizhou,

on the Python side there is http://dask.pydata.org/en/latest/ that can read 
large, distributed Parquet datasets. When using `engine=pyarrow`, it also uses 
parquet-cpp under the hood.

On the pure C++ side, I know that https://github.com/thrill/thrill has 
experimental parquet support. But this is an experimental feature in an 
experimental framework, so be careful about relying on it.

In general, Parquet files should not exceed the single digit gigabyte size and 
the RowGroups inside these files should also be 128MiB or less. You will be 
able to write tools that can deal with other sizes but that will break a bit 
the portability aspect of Parquet file.

Uwe

On Tue, Apr 3, 2018, at 10:00 AM, 高立周 wrote:
> Hi experts,
>We have a storage engine that needs to manage large set of data (PB 
> level) . Currently we store it as a single parquet file. After some 
> searching, It seems the data should be cut into
> multiple parquet files for further reading/writing/managing.  But I 
> don't know whether there is already opensource solution to read/write/
> manage multiple parquet files.  Our programming language is cpp.
>   Any comments/suggestions are welcomed. Thanks!
> 
> 
> Regards,
> Lizhou


Solution to read/write multiple parquet files

2018-04-03 Thread 高立周
Hi experts,
   We have a storage engine that needs to manage large set of data (PB level) . 
Currently we store it as a single parquet file. After some searching, It seems 
the data should be cut into
multiple parquet files for further reading/writing/managing.  But I don't know 
whether there is already opensource solution to read/write/manage multiple 
parquet files.  Our programming language is cpp.
  Any comments/suggestions are welcomed. Thanks!


Regards,
Lizhou

[jira] [Updated] (PARQUET-1025) Support new min-max statistics in parquet-mr

2018-04-03 Thread Zoltan Ivanfi (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-1025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zoltan Ivanfi updated PARQUET-1025:
---
Fix Version/s: (was: 1.11)
   1.10.0

> Support new min-max statistics in parquet-mr
> 
>
> Key: PARQUET-1025
> URL: https://issues.apache.org/jira/browse/PARQUET-1025
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-mr
>Affects Versions: 1.9.1
>Reporter: Zoltan Ivanfi
>Assignee: Gabor Szadovszky
>Priority: Major
> Fix For: 1.10.0
>
>
> Impala started using new min-max statistics that got specified as part of 
> PARQUET-686. Support for these should be added to parquet-mr as well.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PARQUET-1025) Support new min-max statistics in parquet-mr

2018-04-03 Thread Zoltan Ivanfi (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-1025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zoltan Ivanfi updated PARQUET-1025:
---
Fix Version/s: (was: 1.10.0)
   1.11

> Support new min-max statistics in parquet-mr
> 
>
> Key: PARQUET-1025
> URL: https://issues.apache.org/jira/browse/PARQUET-1025
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-mr
>Affects Versions: 1.9.1
>Reporter: Zoltan Ivanfi
>Assignee: Gabor Szadovszky
>Priority: Major
> Fix For: 1.11
>
>
> Impala started using new min-max statistics that got specified as part of 
> PARQUET-686. Support for these should be added to parquet-mr as well.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)