date:20180322

[jira] [Commented] (PARQUET-1245) [C++] Segfault when writing Arrow table with duplicate columns

2018-03-22 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/PARQUET-1245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16410584#comment-16410584
 ] 

ASF GitHub Bot commented on PARQUET-1245:
-

wesm commented on issue #447: PARQUET-1245: Fix creating Arrow table with 
duplicate column names
URL: https://github.com/apache/parquet-cpp/pull/447#issuecomment-375495170
 
 
   @cpcloud @pitrou we may want to bump parquet-cpp in conda-forge to pick up 
this patch at some point


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] Segfault when writing Arrow table with duplicate columns
> --
>
> Key: PARQUET-1245
> URL: https://issues.apache.org/jira/browse/PARQUET-1245
> Project: Parquet
>  Issue Type: Bug
> Environment: Linux Mint 18.2
> Anaconda Python distribution + pyarrow installed from the conda-forge channel
>Reporter: Alexey Strokach
>Assignee: Antoine Pitrou
>Priority: Minor
>  Labels: pull-request-available
> Fix For: cpp-1.5.0
>
>
> I accidentally created a large number of Parquet files with two 
> __index_level_0__ columns (through a Spark SQL query).
> PyArrow can read these files into tables, but it segfaults when converting 
> the resulting tables to Pandas DataFrames or when saving the tables to 
> Parquet files.
> {code:none}
> # Duplicate columns cause segmentation faults
> table = pq.read_table('/path/to/duplicate_column_file.parquet')
> table.to_pandas()  # Segmentation fault
> pq.write_table(table, '/some/output.parquet') # Segmentation fault
> {code}
> If I remove the duplicate column using table.remove_column(...) everything 
> works without segfaults.
> {code:none}
> # After removing duplicate columns, everything works fine
> table = pq.read_table('/path/to/duplicate_column_file.parquet')
> table.remove_column(34)
> table.to_pandas()  # OK
> pq.write_table(table, '/some/output.parquet')  # OK
> {code}
> For more concrete examples, see `test_segfault_1.py` and `test_segfault_2.py` 
> here: https://gitlab.com/ostrokach/pyarrow_duplicate_column_errors.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (PARQUET-1245) [C++] Segfault when writing Arrow table with duplicate columns

2018-03-22 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/PARQUET-1245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16410582#comment-16410582
 ] 

ASF GitHub Bot commented on PARQUET-1245:
-

wesm closed pull request #447: PARQUET-1245: Fix creating Arrow table with 
duplicate column names
URL: https://github.com/apache/parquet-cpp/pull/447
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/src/parquet/arrow/arrow-reader-writer-test.cc 
b/src/parquet/arrow/arrow-reader-writer-test.cc
index 72e65d47..f06f4a87 100644
--- a/src/parquet/arrow/arrow-reader-writer-test.cc
+++ b/src/parquet/arrow/arrow-reader-writer-test.cc
@@ -1669,6 +1669,27 @@ TEST(TestArrowReadWrite, TableWithChunkedColumns) {
   }
 }
 
+TEST(TestArrowReadWrite, TableWithDuplicateColumns) {
+  // See ARROW-1974
+  using ::arrow::ArrayFromVector;
+
+  auto f0 = field("duplicate", ::arrow::int8());
+  auto f1 = field("duplicate", ::arrow::int16());
+  auto schema = ::arrow::schema({f0, f1});
+
+  std::vector a0_values = {1, 2, 3};
+  std::vector a1_values = {14, 15, 16};
+
+  std::shared_ptr a0, a1;
+
+  ArrayFromVector<::arrow::Int8Type, int8_t>(a0_values, );
+  ArrayFromVector<::arrow::Int16Type, int16_t>(a1_values, );
+
+  auto table = Table::Make(schema, {std::make_shared(f0->name(), a0),
+std::make_shared(f1->name(), a1)});
+  CheckSimpleRoundtrip(table, table->num_rows());
+}
+
 TEST(TestArrowWrite, CheckChunkSize) {
   const int num_columns = 2;
   const int num_rows = 128;
diff --git a/src/parquet/arrow/arrow-schema-test.cc 
b/src/parquet/arrow/arrow-schema-test.cc
index d502d243..da6af528 100644
--- a/src/parquet/arrow/arrow-schema-test.cc
+++ b/src/parquet/arrow/arrow-schema-test.cc
@@ -165,6 +165,31 @@ TEST_F(TestConvertParquetSchema, ParquetFlatPrimitives) {
   CheckFlatSchema(arrow_schema);
 }
 
+TEST_F(TestConvertParquetSchema, DuplicateFieldNames) {
+  std::vector parquet_fields;
+  std::vector arrow_fields;
+
+  parquet_fields.push_back(
+  PrimitiveNode::Make("xxx", Repetition::REQUIRED, ParquetType::BOOLEAN));
+  auto arrow_field1 = std::make_shared("xxx", BOOL, false);
+
+  parquet_fields.push_back(
+  PrimitiveNode::Make("xxx", Repetition::REQUIRED, ParquetType::INT32));
+  auto arrow_field2 = std::make_shared("xxx", INT32, false);
+
+  ASSERT_OK(ConvertSchema(parquet_fields));
+  arrow_fields = {arrow_field1, arrow_field2};
+  CheckFlatSchema(std::make_shared<::arrow::Schema>(arrow_fields));
+
+  ASSERT_OK(ConvertSchema(parquet_fields, std::vector({0, 1})));
+  arrow_fields = {arrow_field1, arrow_field2};
+  CheckFlatSchema(std::make_shared<::arrow::Schema>(arrow_fields));
+
+  ASSERT_OK(ConvertSchema(parquet_fields, std::vector({1, 0})));
+  arrow_fields = {arrow_field2, arrow_field1};
+  CheckFlatSchema(std::make_shared<::arrow::Schema>(arrow_fields));
+}
+
 TEST_F(TestConvertParquetSchema, ParquetKeyValueMetadata) {
   std::vector parquet_fields;
   std::vector arrow_fields;
diff --git a/src/parquet/arrow/reader.cc b/src/parquet/arrow/reader.cc
index bd68ec32..78c3225a 100644
--- a/src/parquet/arrow/reader.cc
+++ b/src/parquet/arrow/reader.cc
@@ -443,7 +443,7 @@ Status FileReader::Impl::ReadRowGroup(int row_group_index,
 }
 
 Status FileReader::Impl::ReadTable(const std::vector& indices,
-   std::shared_ptr* table) {
+   std::shared_ptr* out) {
   std::shared_ptr<::arrow::Schema> schema;
   RETURN_NOT_OK(GetSchema(indices, ));
 
@@ -473,7 +473,9 @@ Status FileReader::Impl::ReadTable(const std::vector& 
indices,
 RETURN_NOT_OK(ParallelFor(nthreads, num_fields, ReadColumnFunc));
   }
 
-  *table = Table::Make(schema, columns);
+  std::shared_ptr table = Table::Make(schema, columns);
+  RETURN_NOT_OK(table->Validate());
+  *out = table;
   return Status::OK();
 }
 
diff --git a/src/parquet/schema-test.cc b/src/parquet/schema-test.cc
index c8cce9fa..ec9aff42 100644
--- a/src/parquet/schema-test.cc
+++ b/src/parquet/schema-test.cc
@@ -292,6 +292,17 @@ class TestGroupNode : public ::testing::Test {
 
 return fields;
   }
+
+  NodeVector Fields2() {
+// Fields with a duplicate name
+NodeVector fields;
+
+fields.push_back(Int32("duplicate", Repetition::REQUIRED));
+fields.push_back(Int64("unique"));
+fields.push_back(Double("duplicate"));
+
+return fields;
+  }
 };
 
 TEST_F(TestGroupNode, Attrs) {
@@ -346,14 +357,23 @@ TEST_F(TestGroupNode, FieldIndex) {
   GroupNode group("group", Repetition::REQUIRED, fields);
   for (size_t i = 0; i < fields.size(); i++) {
 auto field = group.field(static_cast(i));
-ASSERT_EQ(i, group.FieldIndex(*field.get()));
+

[jira] [Resolved] (PARQUET-1245) [C++] Segfault when writing Arrow table with duplicate columns

2018-03-22 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved PARQUET-1245.
---
Resolution: Fixed

Issue resolved by pull request 447
[https://github.com/apache/parquet-cpp/pull/447]

> [C++] Segfault when writing Arrow table with duplicate columns
> --
>
> Key: PARQUET-1245
> URL: https://issues.apache.org/jira/browse/PARQUET-1245
> Project: Parquet
>  Issue Type: Bug
> Environment: Linux Mint 18.2
> Anaconda Python distribution + pyarrow installed from the conda-forge channel
>Reporter: Alexey Strokach
>Assignee: Antoine Pitrou
>Priority: Minor
>  Labels: pull-request-available
> Fix For: cpp-1.5.0
>
>
> I accidentally created a large number of Parquet files with two 
> __index_level_0__ columns (through a Spark SQL query).
> PyArrow can read these files into tables, but it segfaults when converting 
> the resulting tables to Pandas DataFrames or when saving the tables to 
> Parquet files.
> {code:none}
> # Duplicate columns cause segmentation faults
> table = pq.read_table('/path/to/duplicate_column_file.parquet')
> table.to_pandas()  # Segmentation fault
> pq.write_table(table, '/some/output.parquet') # Segmentation fault
> {code}
> If I remove the duplicate column using table.remove_column(...) everything 
> works without segfaults.
> {code:none}
> # After removing duplicate columns, everything works fine
> table = pq.read_table('/path/to/duplicate_column_file.parquet')
> table.remove_column(34)
> table.to_pandas()  # OK
> pq.write_table(table, '/some/output.parquet')  # OK
> {code}
> For more concrete examples, see `test_segfault_1.py` and `test_segfault_2.py` 
> here: https://gitlab.com/ostrokach/pyarrow_duplicate_column_errors.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Date and time for the next Parquet sync

2018-03-22 Thread Lars Volker

Following our biweekly cadence we should have a Parquet community sync next
week. Last time we met on a Tuesday, so this time it should be Wednesday.

I propose to meet next Wednesday, March 28th, at 6pm CET / 9am PST. Europe
switches to daylight saving time during the weekend so we will be back to 9
hours difference.

Please speak up if that time does not work for you.

Cheers, Lars

[jira] [Resolved] (PARQUET-323) INT96 should be marked as deprecated

2018-03-22 Thread Lars Volker (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lars Volker resolved PARQUET-323.
-
Resolution: Fixed

> INT96 should be marked as deprecated
> 
>
> Key: PARQUET-323
> URL: https://issues.apache.org/jira/browse/PARQUET-323
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Cheng Lian
>Assignee: Lars Volker
>Priority: Major
>
> As discussed in the mailing list, {{INT96}} is only used to represent nanosec 
> timestamp in Impala for some historical reasons, and should be deprecated. 
> Since nanosec precision is rarely a real requirement, one possible and simple 
> solution would be replacing {{INT96}} with {{INT64 (TIMESTAMP_MILLIS)}} or 
> {{INT64 (TIMESTAMP_MICROS)}}.
> Several projects (Impala, Hive, Spark, ...) support INT96.
> We need a clear spec of the replacement and the path to deprecation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (PARQUET-323) INT96 should be marked as deprecated

2018-03-22 Thread Lars Volker (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lars Volker updated PARQUET-323:

Fix Version/s: format-2.5.0

> INT96 should be marked as deprecated
> 
>
> Key: PARQUET-323
> URL: https://issues.apache.org/jira/browse/PARQUET-323
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Cheng Lian
>Assignee: Lars Volker
>Priority: Major
> Fix For: format-2.5.0
>
>
> As discussed in the mailing list, {{INT96}} is only used to represent nanosec 
> timestamp in Impala for some historical reasons, and should be deprecated. 
> Since nanosec precision is rarely a real requirement, one possible and simple 
> solution would be replacing {{INT96}} with {{INT64 (TIMESTAMP_MILLIS)}} or 
> {{INT64 (TIMESTAMP_MICROS)}}.
> Several projects (Impala, Hive, Spark, ...) support INT96.
> We need a clear spec of the replacement and the path to deprecation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (PARQUET-861) Document INT96 timestamps

2018-03-22 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/PARQUET-861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16410497#comment-16410497
 ] 

ASF GitHub Bot commented on PARQUET-861:


lekv closed pull request #49: PARQUET-861: Document INT96 timestamps
URL: https://github.com/apache/parquet-format/pull/49
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/LogicalTypes.md b/LogicalTypes.md
index c411dbfa..717b2903 100644
--- a/LogicalTypes.md
+++ b/LogicalTypes.md
@@ -144,6 +144,18 @@ example, there is no requirement that a large number of 
days should be
 expressed as a mix of months and days because there is not a constant
 conversion from days to months.
 
+### INT96 timestamps (also called IMPALA_TIMESTAMP)
+
+_(deprecated)_ Timestamps saved as an `int96` are made up of the nanoseconds 
in the day
+(first 8 byte) and the Julian day (last 4 bytes). No timezone is attached to 
this value.
+To convert the timestamp into nanoseconds since the Unix epoch, 00:00:00.00
+on 1 January 1970, the following formula can be used:
+`(julian_day - 2440588) * (86400 * 1000 * 1000 * 1000) + nanoseconds`.
+The magic number `2440588` is the julian day for 1 January 1970.
+
+Note that these timestamps are the common usage of the `int96` physical type 
and are not
+marked with a special logical type annotation.
+
 ## Embedded Types
 
 ### JSON


 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Document INT96 timestamps
> -
>
> Key: PARQUET-861
> URL: https://issues.apache.org/jira/browse/PARQUET-861
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-format
>Reporter: Uwe L. Korn
>Assignee: Uwe L. Korn
>Priority: Major
>
> Although considered as deprecated, they should be documented as the format is 
> quite special.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (PARQUET-323) INT96 should be marked as deprecated

2018-03-22 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/PARQUET-323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16410496#comment-16410496
 ] 

ASF GitHub Bot commented on PARQUET-323:


lekv closed pull request #86: PARQUET-323: Mark INT96 as deprecated
URL: https://github.com/apache/parquet-format/pull/86
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/Encodings.md b/Encodings.md
index 28429be7..b8905bf4 100644
--- a/Encodings.md
+++ b/Encodings.md
@@ -34,7 +34,7 @@ stores the data in the following format:
  - BOOLEAN: [Bit Packed](#RLE), LSB first
  - INT32: 4 bytes little endian
  - INT64: 8 bytes little endian
- - INT96: 12 bytes little endian
+ - INT96: 12 bytes little endian (deprecated)
  - FLOAT: 4 bytes IEEE little endian
  - DOUBLE: 8 bytes IEEE little endian
  - BYTE_ARRAY: length in 4 bytes little endian followed by the bytes contained 
in the array
diff --git a/src/main/thrift/parquet.thrift b/src/main/thrift/parquet.thrift
index 195ff908..4d2e7001 100644
--- a/src/main/thrift/parquet.thrift
+++ b/src/main/thrift/parquet.thrift
@@ -33,7 +33,7 @@ enum Type {
   BOOLEAN = 0;
   INT32 = 1;
   INT64 = 2;
-  INT96 = 3;
+  INT96 = 3;  // deprecated, only used by legacy implementations.
   FLOAT = 4;
   DOUBLE = 5;
   BYTE_ARRAY = 6;


 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> INT96 should be marked as deprecated
> 
>
> Key: PARQUET-323
> URL: https://issues.apache.org/jira/browse/PARQUET-323
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Cheng Lian
>Assignee: Lars Volker
>Priority: Major
>
> As discussed in the mailing list, {{INT96}} is only used to represent nanosec 
> timestamp in Impala for some historical reasons, and should be deprecated. 
> Since nanosec precision is rarely a real requirement, one possible and simple 
> solution would be replacing {{INT96}} with {{INT64 (TIMESTAMP_MILLIS)}} or 
> {{INT64 (TIMESTAMP_MICROS)}}.
> Several projects (Impala, Hive, Spark, ...) support INT96.
> We need a clear spec of the replacement and the path to deprecation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (PARQUET-1252) [C++] Pass BOOST_ROOT and Boost_NAMESPACE on to Thrift EP

2018-03-22 Thread Uwe L. Korn (JIRA)

Uwe L. Korn created PARQUET-1252:


 Summary: [C++] Pass BOOST_ROOT and Boost_NAMESPACE on to Thrift EP
 Key: PARQUET-1252
 URL: https://issues.apache.org/jira/browse/PARQUET-1252
 Project: Parquet
  Issue Type: Bug
  Components: parquet-cpp
Reporter: Uwe L. Korn
 Fix For: cpp-1.5.0


Currently we build the {{thrift_ep}} with the Boost version it finds by itself. 
In the case where {{parquet-cpp}} is built with a very specific Boost version, 
we also need to build it using this version. This needs passing along of 
{{BOOST_ROOT}} and {{Boost_NAMESPACE}} to the Thrift ExternalProject.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (PARQUET-1173) com.fasterxml.jackson.core.jackson dependency harmonization

2018-03-22 Thread Gabor Szadovszky (JIRA)


[ 
https://issues.apache.org/jira/browse/PARQUET-1173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16409429#comment-16409429
 ] 

Gabor Szadovszky commented on PARQUET-1173:
---

I don't think we should synchronize the version of dependencies and transitive 
dependencies. We should upgrade to the latest {{fasterxml}} jackson instead of 
using the ancient {{codehaus}} one, though.
The problem is that the {{codehaus}} jackson is part of the Avro public API 
(see AVRO-1605). It means, we are not able to drop the {{codehaus}} dependency 
until Avro removes jackson from its API.

> com.fasterxml.jackson.core.jackson dependency harmonization
> ---
>
> Key: PARQUET-1173
> URL: https://issues.apache.org/jira/browse/PARQUET-1173
> Project: Parquet
>  Issue Type: Improvement
>Reporter: Davide Gesino
>Priority: Minor
>
> Parquet as a whole depends on many jackson versions, also legacy release.
> there are 2 overlapping dependencies on *com.fasterxml.jackson.core* bundles:
> 2.7.1, 2.3.1 and 2.3.0 
> [INFO] +- org.apache.arrow:arrow-vector:jar:0.1.0:compile
> [INFO] |  +- joda-time:joda-time:jar:2.9:compile   
> [INFO] |  +- com.fasterxml.jackson.core:jackson-annotations:jar:2.7.1:compile 
> [INFO] |  +- com.fasterxml.jackson.core:jackson-databind:jar:2.7.1:compile
> [INFO] |  |  \- com.fasterxml.jackson.core:jackson-core:jar:2.7.1:compile
> and 
> [INFO] +- com.fasterxml.jackson.core:jackson-databind:jar:2.3.1:compile
> [INFO] |  +- com.fasterxml.jackson.core:jackson-annotations:jar:2.3.0:compile
> [INFO] |  \- com.fasterxml.jackson.core:jackson-core:jar:2.3.1:compile
> It would be better to have only 1 non overlapping dependency tree from 
> jackson.
> Then other submodules of Parquet depend on old "codehaus" release. These 
> should be harmonized as well, at least those that do not need that version to 
> cope with other third party libraries that need the old one.
> I spotted this one.
> *Parquet jackson*
> [INFO] org.apache.parquet:parquet-jackson:jar:1.9.1-SNAPSHOT
> [INFO] +- org.codehaus.jackson:jackson-mapper-asl:jar:1.9.13:compile
> [INFO] +- org.codehaus.jackson:jackson-core-asl:jar:1.9.13:compile



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Assigned] (PARQUET-1236) Upgrade org.slf4j:slf4j-api:1.7.2 to 1.7.12

2018-03-22 Thread Gabor Szadovszky (JIRA)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Szadovszky reassigned PARQUET-1236:
-

Assignee: PandaMonkey

> Upgrade org.slf4j:slf4j-api:1.7.2 to 1.7.12
> ---
>
> Key: PARQUET-1236
> URL: https://issues.apache.org/jira/browse/PARQUET-1236
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Affects Versions: format-2.5.0
>Reporter: PandaMonkey
>Assignee: PandaMonkey
>Priority: Major
> Fix For: format-2.5.0
>
> Attachments: parquet-format.txt
>
>
> Hi, I found two versions of library org.slf4j:slf4j-api in your project. It 
> would be nice to keep the version consistency.
> Their introduced path is:
>  # 
> org.apache.parquet:parquet-format:2.4.1-SNAPSHOT::null->org.slf4j:slf4j-api:1.7.2::compile
>  # 
> org.apache.parquet:parquet-format:2.4.1-SNAPSHOT::null->org.apache.thrift:libthrift:0.9.3::compile->org.slf4j:slf4j-api:1.7.12::compile
>  Thanks!
>  
> Regards,
>     Panda



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (PARQUET-1251) Clarify ambiguous min/max stats for FLOAT/DOUBLE

2018-03-22 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/PARQUET-1251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16409177#comment-16409177
 ] 

ASF GitHub Bot commented on PARQUET-1251:
-

gszadovszky commented on issue #88: PARQUET-1251: Clarify ambiguous min/max 
stats for FLOAT/DOUBLE
URL: https://github.com/apache/parquet-format/pull/88#issuecomment-375207117
 
 
   @zivanfi, could you please check this out?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Clarify ambiguous min/max stats for FLOAT/DOUBLE
> 
>
> Key: PARQUET-1251
> URL: https://issues.apache.org/jira/browse/PARQUET-1251
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Affects Versions: format-2.4.0
>Reporter: Gabor Szadovszky
>Assignee: Gabor Szadovszky
>Priority: Major
> Fix For: format-2.5.0
>
>
> Describe the handling of the ambigous min/max statistics for FLOAT/DOUBLE 
> types in case of TypeDefinedOrder. (See PARQUET-1222 for details.)
> * When looking for NaN values, min and max should be ignored.
> * If the min is a NaN, it should be ignored.
> * If the max is a NaN, it should be ignored.
> * If the min is +0, the row group may contain -0 values as well.
> * If the max is -0, the row group may contain +0 values as well.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (PARQUET-1245) [C++] Segfault when writing Arrow table with duplicate columns

[jira] [Commented] (PARQUET-1245) [C++] Segfault when writing Arrow table with duplicate columns

[jira] [Resolved] (PARQUET-1245) [C++] Segfault when writing Arrow table with duplicate columns

Date and time for the next Parquet sync

[jira] [Resolved] (PARQUET-323) INT96 should be marked as deprecated

[jira] [Updated] (PARQUET-323) INT96 should be marked as deprecated

[jira] [Commented] (PARQUET-861) Document INT96 timestamps

[jira] [Commented] (PARQUET-323) INT96 should be marked as deprecated

[jira] [Created] (PARQUET-1252) [C++] Pass BOOST_ROOT and Boost_NAMESPACE on to Thrift EP

[jira] [Commented] (PARQUET-1173) com.fasterxml.jackson.core.jackson dependency harmonization

[jira] [Assigned] (PARQUET-1236) Upgrade org.slf4j:slf4j-api:1.7.2 to 1.7.12

[jira] [Commented] (PARQUET-1251) Clarify ambiguous min/max stats for FLOAT/DOUBLE

12 matches

Site Navigation

Mail list logo

Footer information