n3world commented on pull request #10202: URL: https://github.com/apache/arrow/pull/10202#issuecomment-872631327
> I'm not sure there is an obvious way to solve this problem in parallel. The parser will start parsing block X before block X-1 has finished parsing. The parser input (CSVBlock) doesn't know how many lines it has. That is not discovered until parsing time. So, for example, the parser for block 2 might realize there is an error on the third line of block 2. But, without knowing how many lines are in block 1 (and block 1 may not have finished parsing) it is hard to say what the lines number of that error is. Sorry I don't think I was clear in my comment. I was able to get reporting the byte offset of the row on which the parser error occurred working but ran into the issue getting the offset during conversion. This was to be able to give some location feedback even for parallel parsing where row number is difficult as you mentioned. > You could do a serial pass prior to parsing that just figures out how many lines are in a block but I suspect that would be too much overhead. I agree this wouldn't be worth it > You could delay reporting an error on block X until blocks 1 to (X-1) have finished parsing (so you can know what the line number is). That would probably be the solution I would take if I needed to do this. However, I don't know off-hand how to do that delay in a low-complexity way. That could work but not sure it is worth the complexity -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org