Re: UTF-8 Support for TextParser

Anirudh Fri, 09 Mar 2018 12:43:35 -0800

Hi,

Upon deeper understanding of customer requirement we found out that the
customer uses only ASCII data with MXNet, just that they want the files
containing UTF-8 BOM at the start and files with different control
characters for newline to play well. dmlc-core already supports control
characters for newline.
Since, the UTF-8 BOM in files is a common use case for other users of MXNet
too (for example, saving excel as UTF-8 csv) I will add support for
handling the UTF-8 BOM in dmlc-core.
I won't be working on UTF8CSVParser unless there is a customer requirement
that comes up later on.


Anirudh



On Wed, Feb 28, 2018 at 11:50 PM, Anirudh <anirudh2...@gmail.com> wrote:

> Hi Tianqi,
>
> What do you think about adding a separate parser for CSV with UTF8 support
> in dmlc-core? We can then just add a flag in MXNet for UTF8 and use the
> UTF8 or the ASCII parser based on this flag. (This idea was suggested by
> Mu).
>
> I think there will be some small changes required to the base class
> "TextParserBase" as the method "BackFindEndLine" will have more logic in it
> to check for other code-points for line-breaks, which can be refactored.
> This approach will likely retain the performance of the existing ASCII CSV
> Parser, while allowing MXNet users to make the decision w.r.t usability
> with UTF-8 CSV parser / performance with ASCII CSV parser.
>
> Thanks,
> Anirudh
>
>
> On Mon, Feb 26, 2018 at 5:18 PM, Anirudh <anirudh2...@gmail.com> wrote:
>
>> Hi Marco,
>>
>> I understand that there needs to be a different discussion on strong
>> dependency of mxnet and dmlc-core and how to fix it.
>>
>> Having said that, I think the goals of dmlc-core and mxnet are somewhat
>> aligned. Posting in the MXNet dev list for this case
>> is a good way to gather feedback from both the communities since I
>> consider the MXNet community to be mostly a superset of the dmlc-core
>> community.
>>
>> Anirudh
>>
>> On Mon, Feb 26, 2018 at 5:00 PM, Subramanian, Anirudh <ani...@amazon.com>
>> wrote:
>>
>>> Hi Tianqi,
>>>
>>> The UTF-8 support would enable other formats like CSV more usable.
>>> Otherwise, they have to handle normalizing their data in some way before
>>> using mxnet.
>>> I understand that there is a tradeoff here because of the efficiency
>>> gains from the parser but the expectation of having to normalize their UTF-8
>>> files may turn users away.
>>>
>>> Anirudh
>>>
>>> On 2/26/18, 3:54 PM, "workc...@gmail.com on behalf of Tianqi Chen" <
>>> workc...@gmail.com on behalf of tqc...@cs.washington.edu> wrote:
>>>
>>>     Since LibSVM format is only going to involve numbers and possibly
>>> ascii
>>>     characters, is there any reason adding UTF-8 support? Note that
>>>     generalization always comes with cost of efficiency and there is some
>>>     effort spent on making parser fast
>>>
>>>     Tianqi
>>>
>>>     On Mon, Feb 26, 2018 at 3:38 PM, Anirudh <anirudh2...@gmail.com>
>>> wrote:
>>>
>>>     > Hi all,
>>>     >
>>>     > Currently there is no UTF-8 Support for LibSVM, LibFM or CSV Text
>>> parsers.
>>>     > I am currently working on adding UTF-8 support for Text parsers.
>>> Since C++
>>>     > doesn't have a great built-in support for UTF-8, I am looking at
>>>     > third-party libraries which provide Unicode support. I am
>>> considering ICU
>>>     > currently. Any comments, suggestions, past experience, gotchas
>>> about
>>>     > unicode third party libraries or adding unicode support in general
>>> is
>>>     > highly appreciated.
>>>     >
>>>     > I have created an issue about the same:
>>>     > https://github.com/dmlc/dmlc-core/issues/372
>>>     > Please feel free to reply to this email or comment on the github
>>> issue if
>>>     > you have any inputs.
>>>     >
>>>     > Anirudh
>>>     >
>>>
>>>
>>>
>>
>

Re: UTF-8 Support for TextParser

Reply via email to