westonpace commented on pull request #10202:
URL: https://github.com/apache/arrow/pull/10202#issuecomment-872540450


   I'm not sure there is an obvious way to solve this problem in parallel.  The 
parser will start parsing block X before block X-1 has finished parsing.  The 
parser input (CSVBlock) doesn't know how many lines it has.  That is not 
discovered until parsing time.  So, for example, the parser for block 2 might 
realize there is an error on the third line of block 2.  But, without knowing 
how many lines are in block 1 (and block 1 may not have finished parsing) it is 
hard to say what the lines number of that error is.
   
   You could do a serial pass prior to parsing that just figures out how many 
lines are in a block but I suspect that would be too much overhead.
   
   You could delay reporting an error on block X until blocks 1 to (X-1) have 
finished parsing (so you can know what the line number is).  That would 
probably be the solution I would take if I needed to do this.  However, I don't 
know off-hand how to do that delay in a low-complexity way.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to