Re: [I] 5000x performance degradation in RecordFromJSON vs NewJSONReader (yes really) [arrow-go]

via GitHub Wed, 23 Jul 2025 17:03:48 -0700


caldempsey commented on issue #448:
URL: https://github.com/apache/arrow-go/issues/448#issuecomment-3111522827


   @zeroshade Hey  I appreciate the fast responses. The `TableFromJSON` 
function actually returns one row out of 10K in my test when passed to a 
DataFrame, and I've updated **_that_** GitHub issue with a complete standalone 
test you can use to repro the DataFrame problem, ready for you to plug a Spark 
Connect URL in if you do fancy testing it. This is with:
   
   ```go
        table, err := array.TableFromJSON(memory.DefaultAllocator, schema,
                jsonData, // one payload per line
                array.WithMultipleDocs(), // without this : failed to create 
Arrow table from JSON: json doc must be an array, found {
        )
   ```
   
   I _then_ looked at the other method, saw the same issue, and eventually 
worked out that actually whatever you set `WithChunks()` to, is always equal to 
the number of rows that get parsed, unless its `-1`.  So I trust this 
reproduces that problem:
   ```go
   array.NewJSONReader(bytes.NewReader(ndjsonData), schema,
                array.WithAllocator(pool),
                array.WithChunk(100))
   ```
   
   Apologies, the initial helper functions I provided were a bit... Scrappy. 
The `[]string{}` solution didn't work for me due to the performance issue above 
with real data (all my operations timed out, so unusable). I then tried to pass 
a record to each member of the slice, which yields only the first chunk. 
   
   Because of not being able to use `[]string{}` in my production level use 
cases, that's how I went down this performance rabbit hole, and eventually got 
here! 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [I] 5000x performance degradation in RecordFromJSON vs NewJSONReader (yes really) [arrow-go]

Reply via email to