Re: [I] 5000x performance degradation in RecordFromJSON vs NewJSONReader (yes really) [arrow-go]

via GitHub Wed, 23 Jul 2025 15:42:43 -0700


caldempsey commented on issue #448:
URL: https://github.com/apache/arrow-go/issues/448#issuecomment-3110786681


   @zeroshade Thanks! The weird thing is there's also a bug with the chunking 
in relation to Spark Connect. 
   
   https://github.com/apache/spark-connect-go/issues/155. 
   
   The reason I feel this might be related is the number of chunks set is 
_always_ equal to the number of rows in the final DF from 
`CreateDataFrameFromArrow` (Spark Connect Go) _only_ when we consume from JSON. 
`RecordFromJSON` becomes basically unusable because it only ever returns one 
row. That can't be intended behaviour, so something is broken on the Spark 
Connect Go side or on yours w.r.t chunking.
   
   So, this might be a symptom of a deeper issue in how those chunks are being 
organised (incremental reading into different chunks might be broken as _only_ 
the first chunk seems to be read into the final DF). I feel like you're going 
chunk by chunk (1 row per chunk), but I don't have the expertise in the 
underlying arrow format to really say.
   
   If you feel this is unrelated to this bug, can I ask to file a new issue 
here so someone more familiar with Arrow can help me work out if this is an 
`arrow-go` issue or a `spark-connect-go` issue?   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] 5000x performance degradation in RecordFromJSON vs NewJSONReader (yes really) [arrow-go]

Reply via email to