snakingfire opened a new issue, #44640:
URL: https://github.com/apache/arrow/issues/44640

   ### Describe the usage question you have. Please include as many useful 
details as  possible.
   
   
   Asking this as a usage question instead of a bug report because it is more 
likely this is a usage issue than a library problem, but I don't know for sure. 
   
   I'm indirectly using arrow as a dependency of pyarrow which is being used by 
the AWS Wrangler SDK (which is what my application directly interfaces with). 
   
   I have run into an error condition when attempting to convert a Pandas data 
frame to a arrow table to be written as Parquet to S3. The specific error comes 
from 
[builder_nested.cc](https://github.com/apache/arrow/blob/00e7c65e17f7d59f7c9954473b15b8ffae8dfd1a/cpp/src/arrow/array/builder_nested.cc#L103)
   
   When the error occurs, I can see the following log printed, immediately 
followed by a SIGABRT and process crash:
   ```
   /arrow/cpp/src/arrow/array/builder_nested.cc:103:  Check failed: 
(item_builder_->length()) == (key_builder_->length()) keys and items builders 
don't have the same size in MapBuilder
   Aborted (core dumped)
   ```
   
   Given this code is many layers of abstraction away from my application code, 
I am having an very hard time tracing down the source of the issue. 
   
   What I know / have been able to track down so far:
   1. The issue has something to do with a specific string<>string Map column. 
The issue is likely due to the values included in that column, since the error 
doesn't happen when the column is excluded, or has it's contents set to dummy 
values
   
   2. The issue requires a certain number of rows in the dataset. When I 
manually partition my input data into small chunks and serialize each partition 
individually and separately, the error does not occur and each individual 
partition can be serialized successfully. The error only shows up when 
serializing a significantly large enough number of records (in my case, ~8M 
rows). 
   
   
   I have been working to try to narrow down a minimal reproduction, but it has 
been slow going. In the meantime, I would like ask for help seeing if there are 
any potential steps I could take to narrow down the potential causes of the 
issue. As a first step, it would be helpful to understand what scenario this 
protection check is intended to guard against, so I can see whether anything 
that I am doing with the data I am trying to serialize is likely to be tripping 
it. 
   
   Unfortunately, it seems like hitting this check condition is a very uncommon 
occurrence so there are next to no existing reports or discussions online about 
the conditions under which the arrow check fails. 
   
   Any help is much appreciated. 
   
   ### Component(s)
   
   C++, Parquet, Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to