[jira] [Assigned] (PARQUET-323) INT96 should be marked as deprecated

2018-03-12 Thread Lars Volker (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lars Volker reassigned PARQUET-323:
---

Assignee: Lars Volker

> INT96 should be marked as deprecated
> 
>
> Key: PARQUET-323
> URL: https://issues.apache.org/jira/browse/PARQUET-323
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Cheng Lian
>Assignee: Lars Volker
>Priority: Major
>
> As discussed in the mailing list, {{INT96}} is only used to represent nanosec 
> timestamp in Impala for some historical reasons, and should be deprecated. 
> Since nanosec precision is rarely a real requirement, one possible and simple 
> solution would be replacing {{INT96}} with {{INT64 (TIMESTAMP_MILLIS)}} or 
> {{INT64 (TIMESTAMP_MICROS)}}.
> Several projects (Impala, Hive, Spark, ...) support INT96.
> We need a clear spec of the replacement and the path to deprecation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: Date for next Parquet sync

2018-03-12 Thread Lars Volker
I sent out a meeting request for tomorrow, Tuesday, 10am PDT, 6pm CET, 5pm
UTC. If you want to join and have not received an invite, please reach out
to me.

Cheers, Lars

On Thu, Mar 8, 2018 at 4:22 PM, Julien Le Dem 
wrote:

> Actually because of Daylight saving time we will have one less hour next
> week.
> https://www.timeanddate.com/worldclock/meetingdetails.
> html?year=2018=3=13=17=0=0=224=50=195
> Location Local Time Time Zone UTC Offset
> San Francisco (USA - California) Tuesday, March 13, 2018 at 10:00:00
> am PDT UTC-7
> hours
> Budapest (Hungary) Tuesday, March 13, 2018 at 6:00:00 pm CET UTC+1 hour
> Paris (France - Île-de-France) Tuesday, March 13, 2018 at 6:00:00 pm CET
> UTC+1
> hour
> Corresponding UTC (GMT) Tuesday, March 13, 2018 at 17:00:00
>
>
> On Thu, Mar 8, 2018 at 4:12 PM, Julien Le Dem 
> wrote:
>
> > or 10am PST but it's a little late for the team in Budapest.
> >
> > On Thu, Mar 8, 2018 at 4:11 PM, Julien Le Dem 
> > wrote:
> >
> >> I'm sorry, it turns out I now have a conflict at this particular time.
> >> Maybe Wednesday?
> >>
> >> On Mon, Mar 5, 2018 at 10:55 AM, Lars Volker  wrote:
> >>
> >>> Hi All,
> >>>
> >>> It has been almost 3 weeks since the last sync and there are a bunch of
> >>> ongoing discussions on the mailing list. Let's find a date for the next
> >>> Parquet community sync. Last time we met on a Wednesday, so this time
> it
> >>> should be Tuesday.
> >>>
> >>> I propose to meet next Tuesday, March 13th, at 6pm CET / 9am PST. That
> >>> allows us to get back to the biweekly cadence without overlapping with
> >>> the
> >>> Arrow sync, which happens this week.
> >>>
> >>> Please speak up if that time does not work for you.
> >>>
> >>> Cheers, Lars
> >>>
> >>
> >>
> >
>


[jira] [Commented] (PARQUET-1244) Documentation link to logical types broken

2018-03-12 Thread Antoine Pitrou (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-1244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16395754#comment-16395754
 ] 

Antoine Pitrou commented on PARQUET-1244:
-

Actually it seems the page I linked to is an outdated version of 
[https://github.com/apache/parquet-format/blob/master/README.md] 

> Documentation link to logical types broken
> --
>
> Key: PARQUET-1244
> URL: https://issues.apache.org/jira/browse/PARQUET-1244
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Antoine Pitrou
>Priority: Minor
>
> The link to {{LogicalTypes.md}} here is broken:
> https://parquet.apache.org/documentation/latest/



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (PARQUET-1209) locally defined symbol ... imported in function ..

2018-03-12 Thread Uwe L. Korn (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-1209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe L. Korn reassigned PARQUET-1209:


Assignee: rip.nsk

> locally defined symbol ... imported in function ..
> --
>
> Key: PARQUET-1209
> URL: https://issues.apache.org/jira/browse/PARQUET-1209
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cpp
>Reporter: rip.nsk
>Assignee: rip.nsk
>Priority: Major
> Fix For: cpp-1.5.0
>
>
> Got the following linker warning LNK4217:
> locally defined symbol ??1Status@arrow@@QEAA@XZ (public: __cdecl 
> arrow::Status::~Status(void)) imported in function "private: void __cdecl 
> parquet::TypedRowGroupStatistics >::Copy(struct 
> parquet::ByteArray const &,struct parquet::ByteArray *,class 
> arrow::PoolBuffer *)" 
> (?Copy@?$TypedRowGroupStatistics@U?$DataType@$05@parquet@@@parquet@@AEAAXAEBUByteArray@2@PEAU32@PEAVPoolBuffer@arrow@@@Z)
> not sure, is it parquet or arrow issue.
> [https://docs.microsoft.com/en-us/cpp/error-messages/tool-errors/linker-tools-warning-lnk4217|http://example.com/]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PARQUET-1245) [C++] Segfault when writing Arrow table with duplicate columns

2018-03-12 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-1245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16395748#comment-16395748
 ] 

ASF GitHub Bot commented on PARQUET-1245:
-

pitrou commented on a change in pull request #447: PARQUET-1245: Fix creating 
Arrow table with duplicate column names
URL: https://github.com/apache/parquet-cpp/pull/447#discussion_r173916004
 
 

 ##
 File path: src/parquet/schema.cc
 ##
 @@ -720,17 +718,15 @@ int SchemaDescriptor::ColumnIndex(const std::string& 
node_path) const {
   return search->second;
 }
 
-int SchemaDescriptor::ColumnIndex(const Node& node) const {
-  int result = ColumnIndex(node.path()->ToDotString());
-  if (result < 0) {
-return -1;
-  }
-  DCHECK(result < num_columns());
-  if (!node.Equals(Column(result)->schema_node().get())) {
-// Same path but not the same node
-return -1;
+int SchemaDescriptor::ColumnIndex(const Node* node) const {
 
 Review comment:
   Hmm... I didn't know this guideline (but out parameters are passed as 
pointers apparently?).
   The search is done using pointer equality, so I thought passing a pointer 
would be more explicit (and perhaps less error-prone, in case the compiler does 
a temporary copy).


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] Segfault when writing Arrow table with duplicate columns
> --
>
> Key: PARQUET-1245
> URL: https://issues.apache.org/jira/browse/PARQUET-1245
> Project: Parquet
>  Issue Type: Bug
> Environment: Linux Mint 18.2
> Anaconda Python distribution + pyarrow installed from the conda-forge channel
>Reporter: Alexey Strokach
>Assignee: Antoine Pitrou
>Priority: Minor
>  Labels: pull-request-available
> Fix For: cpp-1.5.0
>
>
> I accidentally created a large number of Parquet files with two 
> __index_level_0__ columns (through a Spark SQL query).
> PyArrow can read these files into tables, but it segfaults when converting 
> the resulting tables to Pandas DataFrames or when saving the tables to 
> Parquet files.
> {code:none}
> # Duplicate columns cause segmentation faults
> table = pq.read_table('/path/to/duplicate_column_file.parquet')
> table.to_pandas()  # Segmentation fault
> pq.write_table(table, '/some/output.parquet') # Segmentation fault
> {code}
> If I remove the duplicate column using table.remove_column(...) everything 
> works without segfaults.
> {code:none}
> # After removing duplicate columns, everything works fine
> table = pq.read_table('/path/to/duplicate_column_file.parquet')
> table.remove_column(34)
> table.to_pandas()  # OK
> pq.write_table(table, '/some/output.parquet')  # OK
> {code}
> For more concrete examples, see `test_segfault_1.py` and `test_segfault_2.py` 
> here: https://gitlab.com/ostrokach/pyarrow_duplicate_column_errors.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PARQUET-1209) locally defined symbol ... imported in function ..

2018-03-12 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-1209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16395750#comment-16395750
 ] 

ASF GitHub Bot commented on PARQUET-1209:
-

xhochy closed pull request #446: PARQUET-1209: define ARROW_STATIC when 
PARQUET_ARROW_LINKAGE is static
URL: https://github.com/apache/parquet-cpp/pull/446
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/CMakeLists.txt b/CMakeLists.txt
index bca8478c..8a8da0fc 100644
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -573,7 +573,7 @@ else()
 zstd
   )
 
-  add_definitions(-DARROW_EXPORTING)
+  add_definitions(-DARROW_STATIC)
 
   set(ARROW_LINK_LIBS
 arrow_static


 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> locally defined symbol ... imported in function ..
> --
>
> Key: PARQUET-1209
> URL: https://issues.apache.org/jira/browse/PARQUET-1209
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cpp
>Reporter: rip.nsk
>Priority: Major
>
> Got the following linker warning LNK4217:
> locally defined symbol ??1Status@arrow@@QEAA@XZ (public: __cdecl 
> arrow::Status::~Status(void)) imported in function "private: void __cdecl 
> parquet::TypedRowGroupStatistics >::Copy(struct 
> parquet::ByteArray const &,struct parquet::ByteArray *,class 
> arrow::PoolBuffer *)" 
> (?Copy@?$TypedRowGroupStatistics@U?$DataType@$05@parquet@@@parquet@@AEAAXAEBUByteArray@2@PEAU32@PEAVPoolBuffer@arrow@@@Z)
> not sure, is it parquet or arrow issue.
> [https://docs.microsoft.com/en-us/cpp/error-messages/tool-errors/linker-tools-warning-lnk4217|http://example.com/]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PARQUET-1245) [C++] Segfault when writing Arrow table with duplicate columns

2018-03-12 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-1245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16395742#comment-16395742
 ] 

ASF GitHub Bot commented on PARQUET-1245:
-

xhochy commented on a change in pull request #447: PARQUET-1245: Fix creating 
Arrow table with duplicate column names
URL: https://github.com/apache/parquet-cpp/pull/447#discussion_r173914725
 
 

 ##
 File path: src/parquet/schema.cc
 ##
 @@ -720,17 +718,15 @@ int SchemaDescriptor::ColumnIndex(const std::string& 
node_path) const {
   return search->second;
 }
 
-int SchemaDescriptor::ColumnIndex(const Node& node) const {
-  int result = ColumnIndex(node.path()->ToDotString());
-  if (result < 0) {
-return -1;
-  }
-  DCHECK(result < num_columns());
-  if (!node.Equals(Column(result)->schema_node().get())) {
-// Same path but not the same node
-return -1;
+int SchemaDescriptor::ColumnIndex(const Node* node) const {
 
 Review comment:
   Why did this change from reference to pointer? We use references everywhere 
where the passed object cannot be null.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] Segfault when writing Arrow table with duplicate columns
> --
>
> Key: PARQUET-1245
> URL: https://issues.apache.org/jira/browse/PARQUET-1245
> Project: Parquet
>  Issue Type: Bug
> Environment: Linux Mint 18.2
> Anaconda Python distribution + pyarrow installed from the conda-forge channel
>Reporter: Alexey Strokach
>Assignee: Antoine Pitrou
>Priority: Minor
>  Labels: pull-request-available
> Fix For: cpp-1.5.0
>
>
> I accidentally created a large number of Parquet files with two 
> __index_level_0__ columns (through a Spark SQL query).
> PyArrow can read these files into tables, but it segfaults when converting 
> the resulting tables to Pandas DataFrames or when saving the tables to 
> Parquet files.
> {code:none}
> # Duplicate columns cause segmentation faults
> table = pq.read_table('/path/to/duplicate_column_file.parquet')
> table.to_pandas()  # Segmentation fault
> pq.write_table(table, '/some/output.parquet') # Segmentation fault
> {code}
> If I remove the duplicate column using table.remove_column(...) everything 
> works without segfaults.
> {code:none}
> # After removing duplicate columns, everything works fine
> table = pq.read_table('/path/to/duplicate_column_file.parquet')
> table.remove_column(34)
> table.to_pandas()  # OK
> pq.write_table(table, '/some/output.parquet')  # OK
> {code}
> For more concrete examples, see `test_segfault_1.py` and `test_segfault_2.py` 
> here: https://gitlab.com/ostrokach/pyarrow_duplicate_column_errors.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PARQUET-1244) Documentation link to logical types broken

2018-03-12 Thread Antoine Pitrou (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-1244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16395739#comment-16395739
 ] 

Antoine Pitrou commented on PARQUET-1244:
-

Ditto for the link to {{Encodings.md}}.

> Documentation link to logical types broken
> --
>
> Key: PARQUET-1244
> URL: https://issues.apache.org/jira/browse/PARQUET-1244
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Antoine Pitrou
>Priority: Minor
>
> The link to {{LogicalTypes.md}} here is broken:
> https://parquet.apache.org/documentation/latest/



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PARQUET-1241) Use LZ4 frame format

2018-03-12 Thread Ryan Blue (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-1241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16395711#comment-16395711
 ] 

Ryan Blue commented on PARQUET-1241:


Does anyone know what the Hadoop compression codec produces? That's what we're 
using in the Java implementation, so that's what the current LZ4 codec name 
indicates. I didn't realize there were multiple formats.

> Use LZ4 frame format
> 
>
> Key: PARQUET-1241
> URL: https://issues.apache.org/jira/browse/PARQUET-1241
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp, parquet-format
>Reporter: Lawrence Chan
>Priority: Major
>
> The parquet-format spec doesn't currently specify whether lz4-compressed data 
> should be framed or not. We should choose one and make it explicit in the 
> spec, as they are not inter-operable. After some discussions with others [1], 
> we think it would be beneficial to use the framed format, which adds a small 
> header in exchange for more self-contained decompression as well as a richer 
> feature set (checksums, parallel decompression, etc).
> The current arrow implementation compresses using the lz4 block format, and 
> this would need to be updated when we add the spec clarification.
> If backwards compatibility is a concern, I would suggest adding an additional 
> LZ4_FRAMED compression type, but that may be more noise than anything.
> [1] https://github.com/dask/fastparquet/issues/314



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PARQUET-1245) [C++] Segfault when writing Arrow table with duplicate columns

2018-03-12 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-1245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16395701#comment-16395701
 ] 

ASF GitHub Bot commented on PARQUET-1245:
-

wesm commented on issue #447: PARQUET-1245: Fix creating Arrow table with 
duplicate column names
URL: https://github.com/apache/parquet-cpp/pull/447#issuecomment-372422163
 
 
   Moved the JIRA from Arrow to Parquet


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] Segfault when writing Arrow table with duplicate columns
> --
>
> Key: PARQUET-1245
> URL: https://issues.apache.org/jira/browse/PARQUET-1245
> Project: Parquet
>  Issue Type: Bug
> Environment: Linux Mint 18.2
> Anaconda Python distribution + pyarrow installed from the conda-forge channel
>Reporter: Alexey Strokach
>Assignee: Antoine Pitrou
>Priority: Minor
>  Labels: pull-request-available
> Fix For: cpp-1.5.0
>
>
> I accidentally created a large number of Parquet files with two 
> __index_level_0__ columns (through a Spark SQL query).
> PyArrow can read these files into tables, but it segfaults when converting 
> the resulting tables to Pandas DataFrames or when saving the tables to 
> Parquet files.
> {code:none}
> # Duplicate columns cause segmentation faults
> table = pq.read_table('/path/to/duplicate_column_file.parquet')
> table.to_pandas()  # Segmentation fault
> pq.write_table(table, '/some/output.parquet') # Segmentation fault
> {code}
> If I remove the duplicate column using table.remove_column(...) everything 
> works without segfaults.
> {code:none}
> # After removing duplicate columns, everything works fine
> table = pq.read_table('/path/to/duplicate_column_file.parquet')
> table.remove_column(34)
> table.to_pandas()  # OK
> pq.write_table(table, '/some/output.parquet')  # OK
> {code}
> For more concrete examples, see `test_segfault_1.py` and `test_segfault_2.py` 
> here: https://gitlab.com/ostrokach/pyarrow_duplicate_column_errors.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (PARQUET-1245) [C++] Segfault when writing Arrow table with duplicate columns

2018-03-12 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-1245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned PARQUET-1245:
-

Assignee: Antoine Pitrou

> [C++] Segfault when writing Arrow table with duplicate columns
> --
>
> Key: PARQUET-1245
> URL: https://issues.apache.org/jira/browse/PARQUET-1245
> Project: Parquet
>  Issue Type: Bug
> Environment: Linux Mint 18.2
> Anaconda Python distribution + pyarrow installed from the conda-forge channel
>Reporter: Alexey Strokach
>Assignee: Antoine Pitrou
>Priority: Minor
>  Labels: pull-request-available
> Fix For: cpp-1.5.0
>
>
> I accidentally created a large number of Parquet files with two 
> __index_level_0__ columns (through a Spark SQL query).
> PyArrow can read these files into tables, but it segfaults when converting 
> the resulting tables to Pandas DataFrames or when saving the tables to 
> Parquet files.
> {code:none}
> # Duplicate columns cause segmentation faults
> table = pq.read_table('/path/to/duplicate_column_file.parquet')
> table.to_pandas()  # Segmentation fault
> pq.write_table(table, '/some/output.parquet') # Segmentation fault
> {code}
> If I remove the duplicate column using table.remove_column(...) everything 
> works without segfaults.
> {code:none}
> # After removing duplicate columns, everything works fine
> table = pq.read_table('/path/to/duplicate_column_file.parquet')
> table.remove_column(34)
> table.to_pandas()  # OK
> pq.write_table(table, '/some/output.parquet')  # OK
> {code}
> For more concrete examples, see `test_segfault_1.py` and `test_segfault_2.py` 
> here: https://gitlab.com/ostrokach/pyarrow_duplicate_column_errors.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PARQUET-1245) [C++] Segfault when writing Arrow table with duplicate columns

2018-03-12 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-1245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated PARQUET-1245:
--
Summary: [C++] Segfault when writing Arrow table with duplicate columns  
(was: [Python] Segfault when writing Arrow table with duplicate columns)

> [C++] Segfault when writing Arrow table with duplicate columns
> --
>
> Key: PARQUET-1245
> URL: https://issues.apache.org/jira/browse/PARQUET-1245
> Project: Parquet
>  Issue Type: Bug
> Environment: Linux Mint 18.2
> Anaconda Python distribution + pyarrow installed from the conda-forge channel
>Reporter: Alexey Strokach
>Priority: Minor
>  Labels: pull-request-available
> Fix For: cpp-1.5.0
>
>
> I accidentally created a large number of Parquet files with two 
> __index_level_0__ columns (through a Spark SQL query).
> PyArrow can read these files into tables, but it segfaults when converting 
> the resulting tables to Pandas DataFrames or when saving the tables to 
> Parquet files.
> {code:none}
> # Duplicate columns cause segmentation faults
> table = pq.read_table('/path/to/duplicate_column_file.parquet')
> table.to_pandas()  # Segmentation fault
> pq.write_table(table, '/some/output.parquet') # Segmentation fault
> {code}
> If I remove the duplicate column using table.remove_column(...) everything 
> works without segfaults.
> {code:none}
> # After removing duplicate columns, everything works fine
> table = pq.read_table('/path/to/duplicate_column_file.parquet')
> table.remove_column(34)
> table.to_pandas()  # OK
> pq.write_table(table, '/some/output.parquet')  # OK
> {code}
> For more concrete examples, see `test_segfault_1.py` and `test_segfault_2.py` 
> here: https://gitlab.com/ostrokach/pyarrow_duplicate_column_errors.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (PARQUET-1245) [Python] Segfault when writing Arrow table with duplicate columns

2018-03-12 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-1245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned PARQUET-1245:
-

 Assignee: (was: Antoine Pitrou)
Fix Version/s: (was: 0.9.0)
   cpp-1.5.0
Affects Version/s: (was: 0.8.0)
  Component/s: (was: Python)
   (was: C++)
 Workflow: patch-available, re-open possible  (was: jira)
  Key: PARQUET-1245  (was: ARROW-1974)
  Project: Parquet  (was: Apache Arrow)

> [Python] Segfault when writing Arrow table with duplicate columns
> -
>
> Key: PARQUET-1245
> URL: https://issues.apache.org/jira/browse/PARQUET-1245
> Project: Parquet
>  Issue Type: Bug
> Environment: Linux Mint 18.2
> Anaconda Python distribution + pyarrow installed from the conda-forge channel
>Reporter: Alexey Strokach
>Priority: Minor
>  Labels: pull-request-available
> Fix For: cpp-1.5.0
>
>
> I accidentally created a large number of Parquet files with two 
> __index_level_0__ columns (through a Spark SQL query).
> PyArrow can read these files into tables, but it segfaults when converting 
> the resulting tables to Pandas DataFrames or when saving the tables to 
> Parquet files.
> {code:none}
> # Duplicate columns cause segmentation faults
> table = pq.read_table('/path/to/duplicate_column_file.parquet')
> table.to_pandas()  # Segmentation fault
> pq.write_table(table, '/some/output.parquet') # Segmentation fault
> {code}
> If I remove the duplicate column using table.remove_column(...) everything 
> works without segfaults.
> {code:none}
> # After removing duplicate columns, everything works fine
> table = pq.read_table('/path/to/duplicate_column_file.parquet')
> table.remove_column(34)
> table.to_pandas()  # OK
> pq.write_table(table, '/some/output.parquet')  # OK
> {code}
> For more concrete examples, see `test_segfault_1.py` and `test_segfault_2.py` 
> here: https://gitlab.com/ostrokach/pyarrow_duplicate_column_errors.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PARQUET-1135) upgrade thrift and protobuf dependencies

2018-03-12 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-1135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16395682#comment-16395682
 ] 

ASF GitHub Bot commented on PARQUET-1135:
-

rdblue commented on issue #427: PARQUET-1135: upgrade thrift and protobuf 
dependencies
URL: https://github.com/apache/parquet-mr/pull/427#issuecomment-372419146
 
 
   Is this binary compatible with thrift 0.7.0?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> upgrade thrift and protobuf dependencies
> 
>
> Key: PARQUET-1135
> URL: https://issues.apache.org/jira/browse/PARQUET-1135
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Reporter: Julien Le Dem
>Assignee: Julien Le Dem
>Priority: Major
> Fix For: 1.9.1
>
>
> thrift 0.7.0 -> 0.9.3
>  protobuf 3.2 -> 3.5.1



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (PARQUET-1244) Documentation link to logical types broken

2018-03-12 Thread Antoine Pitrou (JIRA)
Antoine Pitrou created PARQUET-1244:
---

 Summary: Documentation link to logical types broken
 Key: PARQUET-1244
 URL: https://issues.apache.org/jira/browse/PARQUET-1244
 Project: Parquet
  Issue Type: Bug
  Components: parquet-format
Reporter: Antoine Pitrou


The link to {{LogicalTypes.md}} here is broken:
https://parquet.apache.org/documentation/latest/




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)