Re: UTF-8 Support for TextParser

Anirudh Mon, 26 Feb 2018 17:18:47 -0800

Hi Marco,

I understand that there needs to be a different discussion on strong
dependency of mxnet and dmlc-core and how to fix it.


Having said that, I think the goals of dmlc-core and mxnet are somewhat
aligned. Posting in the MXNet dev list for this case
is a good way to gather feedback from both the communities since I consider
the MXNet community to be mostly a superset of the dmlc-core community.

Anirudh

On Mon, Feb 26, 2018 at 5:00 PM, Subramanian, Anirudh <[email protected]>
wrote:

> Hi Tianqi,
>
> The UTF-8 support would enable other formats like CSV more usable.
> Otherwise, they have to handle normalizing their data in some way before
> using mxnet.
> I understand that there is a tradeoff here because of the efficiency gains
> from the parser but the expectation of having to normalize their UTF-8
> files may turn users away.
>
> Anirudh
>
> On 2/26/18, 3:54 PM, "[email protected] on behalf of Tianqi Chen" <
> [email protected] on behalf of [email protected]> wrote:
>
>     Since LibSVM format is only going to involve numbers and possibly ascii
>     characters, is there any reason adding UTF-8 support? Note that
>     generalization always comes with cost of efficiency and there is some
>     effort spent on making parser fast
>
>     Tianqi
>
>     On Mon, Feb 26, 2018 at 3:38 PM, Anirudh <[email protected]>
> wrote:
>
>     > Hi all,
>     >
>     > Currently there is no UTF-8 Support for LibSVM, LibFM or CSV Text
> parsers.
>     > I am currently working on adding UTF-8 support for Text parsers.
> Since C++
>     > doesn't have a great built-in support for UTF-8, I am looking at
>     > third-party libraries which provide Unicode support. I am
> considering ICU
>     > currently. Any comments, suggestions, past experience, gotchas about
>     > unicode third party libraries or adding unicode support in general is
>     > highly appreciated.
>     >
>     > I have created an issue about the same:
>     > https://github.com/dmlc/dmlc-core/issues/372
>     > Please feel free to reply to this email or comment on the github
> issue if
>     > you have any inputs.
>     >
>     > Anirudh
>     >
>
>
>

Re: UTF-8 Support for TextParser

Reply via email to