qiushido commented on issue #622: URL: https://github.com/apache/arrow-go/issues/622#issuecomment-3814975848
> [@qiushido](https://github.com/qiushido) do you have a reproducer? I am not entirely sure I understand the specific point you are raising. My knowledge of Apache Arrow isn't deeply specialized, but I can describe my use case for more context. Since Parquet requires a predefined schema for writing, I used a workaround for my data export tasks to avoid manually defining a schema for every single job. I consume messages from Kafka, parse them into map[string]any, and use reflection to map values to a few fixed Arrow-supported types: - int64 - float64 - string -> largeString - []byte -> largeBinary - Other types outside this range -> json.Marshal -> largeString then use Arrow to write this data to Parquet. During this process, I’ve noticed that if a message is very large (tens or hundreds of MBs)—which I assume means a specific field value is extremely long—writing multiple rows into a single Parquet file triggers this panic. To circumvent this, I currently have to reduce the number of rows per Parquet file (e.g., from 1,000 rows down to 100 rows per file). I consulted Gemini, and it suggested adding the configuration `parquet.WithDictionaryDefault(false)` to disable dictionary encoding and prevent index overflow with large datasets. However, this doesn't seem to have resolved the issue. Additionally, I encountered the error `panic: bytes.Buffer.Grow: negative count`. This happened in a similar data export task, and I also resolved it by reducing the number of rows per Parquet file. Currently, with the row count limited to 100 per file and ZSTD (level 1) compression enabled, the resulting Parquet files are generally under 500MB. I am not sure what the best practices are for Parquet write configurations; if there are recommended settings or parameters to handle this more gracefully, please let me know. Thank you very much! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
