qiushido commented on issue #622:
URL: https://github.com/apache/arrow-go/issues/622#issuecomment-3814975848

   > [@qiushido](https://github.com/qiushido) do you have a reproducer?
   
   I am not entirely sure I understand the specific point you are raising. My 
knowledge of Apache Arrow isn't deeply specialized, but I can describe my use 
case for more context.
   Since Parquet requires a predefined schema for writing, I used a workaround 
for my data export tasks to avoid manually defining a schema for every single 
job. I consume messages from Kafka, parse them into map[string]any, and use 
reflection to map values to a few fixed Arrow-supported types:
   - int64
   -  float64
   - string -> largeString
   - []byte -> largeBinary
   - Other types outside this range -> json.Marshal -> largeString  
   
   then use Arrow to write this data to Parquet. During this process, I’ve 
noticed that if a message is very large (tens or hundreds of MBs)—which I 
assume means a specific field value is extremely long—writing multiple rows 
into a single Parquet file triggers this panic.
   To circumvent this, I currently have to reduce the number of rows per 
Parquet file (e.g., from 1,000 rows down to 100 rows per file). 
   
   I consulted Gemini, and it suggested adding the configuration 
`parquet.WithDictionaryDefault(false)` to disable dictionary encoding and 
prevent index overflow with large datasets. However, this doesn't seem to have 
resolved the issue.
   
   Additionally, I encountered the error `panic: bytes.Buffer.Grow: negative 
count`. This happened in a similar data export task, and I also resolved it by 
reducing the number of rows per Parquet file. Currently, with the row count 
limited to 100 per file and ZSTD (level 1) compression enabled, the resulting 
Parquet files are generally under 500MB.
   
   I am not sure what the best practices are for Parquet write configurations; 
if there are recommended settings or parameters to handle this more gracefully, 
please let me know. Thank you very much!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to