n3world commented on pull request #10202:
URL: https://github.com/apache/arrow/pull/10202#issuecomment-872631327


   > I'm not sure there is an obvious way to solve this problem in parallel. 
The parser will start parsing block X before block X-1 has finished parsing. 
The parser input (CSVBlock) doesn't know how many lines it has. That is not 
discovered until parsing time. So, for example, the parser for block 2 might 
realize there is an error on the third line of block 2. But, without knowing 
how many lines are in block 1 (and block 1 may not have finished parsing) it is 
hard to say what the lines number of that error is.
   
   Sorry I don't think I was clear in my comment. I was able to get reporting 
the byte offset of the row on which the parser error occurred working but ran 
into the issue getting the offset during conversion. This was to be able to 
give some location feedback even for parallel parsing where row number is 
difficult as you mentioned.
   
   > You could do a serial pass prior to parsing that just figures out how many 
lines are in a block but I suspect that would be too much overhead.
   
   I agree this wouldn't be worth it
   
   > You could delay reporting an error on block X until blocks 1 to (X-1) have 
finished parsing (so you can know what the line number is). That would probably 
be the solution I would take if I needed to do this. However, I don't know 
off-hand how to do that delay in a low-complexity way.
   
   That could work but not sure it is worth the complexity
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Reply via email to