[ 
https://issues.apache.org/jira/browse/ARROW-1938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16432768#comment-16432768
 ] 

ASF GitHub Bot commented on ARROW-1938:
---------------------------------------

joshuastorck opened a new pull request #453: Bug fix for ARROW-1938
URL: https://github.com/apache/parquet-cpp/pull/453
 
 
   The error was reported here: 
https://issues.apache.org/jira/browse/ARROW-1938.
   
   Because dictionary types are not supported in writing yet, the code converts 
the dictionary column to the actual values first before writing. However, the 
existing code was accidentally using zero as the offset and the length of the 
column as the size. This resulted in writing all of the column values for each 
chunk of the column that was supposed to be written.
   
   The fix is to pass the offset and size when recursively calling through to 
WriteColumnChunk with the "flattened" data. 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Error writing to partitioned Parquet dataset
> -----------------------------------------------------
>
>                 Key: ARROW-1938
>                 URL: https://issues.apache.org/jira/browse/ARROW-1938
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 0.8.0
>         Environment: Linux (Ubuntu 16.04)
>            Reporter: Robert Dailey
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 0.10.0
>
>         Attachments: ARROW-1938-test-data.csv.gz, ARROW-1938.py, 
> pyarrow_dataset_error.png
>
>
> I receive the following error after upgrading to pyarrow 0.8.0 when writing 
> to a dataset:
> * ArrowIOError: Column 3 had 187374 while previous column had 10000
> The command was:
> write_table_values = {'row_group_size': 10000}
> pq.write_to_dataset(pa.Table.from_pandas(df, preserve_index=True), 
> '/logs/parsed/test', partition_cols=['Product', 'year', 'month', 'day', 
> 'hour'], **write_table_values)
> I've also tried write_table_values = {'chunk_size': 10000} and received the 
> same error.
> This same command works in version 0.7.1.  I am trying to troubleshoot the 
> problem but wanted to submit a ticket.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to