Thanks all, if the PR is available tomorrow I can review as well and merge, but I will be on vacation for a week after that. No pressure :)
Regards, Matt > On Sep 24, 2017, at 8:57 PM, Joe Witt <[email protected]> wrote: > > Thanks Arun and Peter. Getting that resolved will be nice. The > performance difference of the record reader/writer approach in all > this is pretty fantastic so the more we can do to iron out these sorts > of edges the better. Thanks! > >> On Sun, Sep 24, 2017 at 8:56 PM, Peter Wicks (pwicks) <[email protected]> >> wrote: >> Arun, >> >> I'm also using Ctrl+A as a delimiter and had the same problem. I haven't >> had time to write up a PR but it looked like a pretty easy fix to me too. >> >> I can't merge the change if you submit it, but I'd be happy to review it. >> >> --Peter >> >> -----Original Message----- >> From: Arun Manivannan [mailto:[email protected]] >> Sent: Sunday, September 24, 2017 11:17 PM >> To: [email protected] >> Subject: [EXT] ConvertCSVToAvro vs CSVReader - Value Delimiter >> >> Hi, >> >> The ConvertCSVToAvro processor have been having performance issues while >> processing files which are more than a GB and I was suggested to use the >> ConvertRecord that leverages the RecordReader and Writer. Did some tests and >> they do perform well. >> >> Strangely, the CSVReader doesn't accept unicode character as the value >> delimiter - Control A (\u0001) character is the delimiter of my CSV. >> >> Did some analysis and I see that a minor change needs to be made on the >> CSVUtils to unescape the delimiter, like what ConvertCSVToAvro does and also >> modify the SingleCharacterValidator. >> >> Please let me know if you believe this isn't an issue and there's a >> workaround for this. Else, I am more than happy to raise an issue and submit >> a PR for review. >> >> Best Regards, >> Arun
