Thanks all, if the PR is available tomorrow I can review as well and merge, but 
I will be on vacation for a week after that. No pressure :)

Regards,
Matt

> On Sep 24, 2017, at 8:57 PM, Joe Witt <[email protected]> wrote:
> 
> Thanks Arun and Peter.  Getting that resolved will be nice.  The
> performance difference of the record reader/writer approach in all
> this is pretty fantastic so the more we can do to iron out these sorts
> of edges the better.  Thanks!
> 
>> On Sun, Sep 24, 2017 at 8:56 PM, Peter Wicks (pwicks) <[email protected]> 
>> wrote:
>> Arun,
>> 
>> I'm also using Ctrl+A as a delimiter and had the same problem.  I haven't 
>> had time to write up a PR but it looked like a pretty easy fix to me too.
>> 
>> I can't merge the change if you submit it, but I'd be happy to review it.
>> 
>> --Peter
>> 
>> -----Original Message-----
>> From: Arun Manivannan [mailto:[email protected]]
>> Sent: Sunday, September 24, 2017 11:17 PM
>> To: [email protected]
>> Subject: [EXT] ConvertCSVToAvro vs CSVReader - Value Delimiter
>> 
>> Hi,
>> 
>> The ConvertCSVToAvro processor have been having performance issues while 
>> processing files which are more than a GB and I was suggested to use the 
>> ConvertRecord that leverages the RecordReader and Writer. Did some tests and 
>> they do perform well.
>> 
>> Strangely, the CSVReader doesn't accept unicode character as the value 
>> delimiter - Control A  (\u0001) character is the delimiter of my CSV.
>> 
>> Did some analysis and I see that a minor change needs to be made on the 
>> CSVUtils to unescape the delimiter, like what ConvertCSVToAvro does and also 
>> modify the SingleCharacterValidator.
>> 
>> Please let me know if you believe this isn't an issue and there's a 
>> workaround for this. Else, I am more than happy to raise an issue and submit 
>> a PR for review.
>> 
>> Best Regards,
>> Arun

Reply via email to