rajagopr opened a new pull request, #11487:
URL: https://github.com/apache/pinot/pull/11487

   ## Problem
   The CSVRecordReader in place uses the apache common-csv library under the 
hood to iterate over the records. The default and the only iterator from the 
commons-csv library throws an exception on the `hasNext()` method. This makes 
it impossible to iterate over the records whenever an unparseable record is 
encountered. There is no way to override this iterator either as the 
`CSVParser` class is declared final and the iterator is internal to the 
`CSVParser` class. 
   
   Open [issue](https://issues.apache.org/jira/browse/CSV-218) with commons-csv 
library.
   
   ## Solution
   To work around the above problem, the change here is to use an alternate 
iterator by getting hold of the underlying buffered reader and modifying the 
methods `next()` and `hasNext()` in the `CSVRecordReader`. With this change, 
the `hasNext()` method would not throw an exception thereby allowing to make 
progress even when unparseable records are encountered. The drawbacks to this 
approach are: 1) data loss and 2) Reduced ingestion throughput
   
   However, there are situations when this option is desirable and making 
progress is more important. Under such scenarios, the flag 
`skipUnParseableLines` can be set to make use of the line based iterator.
   
   ## Alternate Solutions
   Following were the alternate options considered:
   1) Switch to another library like `OpenCSV` which allows to plugin a custom 
iterator. [Not considered due to: 1) maintenance overhead 2) regression and 3) 
open vulnerability with the library]
   2) Writing a new parser [Not considered as would require significant time 
investment. This would be revisited later.]
   
   ## Testing
   The change is supplemented with unit tests which ensure the regression is 
not caused and the new changes are covered. Additionally, tested the 
performance on a 200MB file and the current parser took on average 5seconds(ran 
20 times) and the new parser took on average 7seconds (ran 20 times).
   
   
   
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to