Re: UTF-8 Support for TextParser

Chris Olivier Fri, 09 Mar 2018 13:12:55 -0800

For this, are you going to run the entire text through a converter, or just
prepend the UTF-8 header to the file (0xEF,0xBB,0xBF)?


On Fri, Mar 9, 2018 at 12:43 PM, Anirudh <[email protected]> wrote:

> Hi,
>
> Upon deeper understanding of customer requirement we found out that the
> customer uses only ASCII data with MXNet, just that they want the files
> containing UTF-8 BOM at the start and files with different control
> characters for newline to play well. dmlc-core already supports control
> characters for newline.
> Since, the UTF-8 BOM in files is a common use case for other users of MXNet
> too (for example, saving excel as UTF-8 csv) I will add support for
> handling the UTF-8 BOM in dmlc-core.
> I won't be working on UTF8CSVParser unless there is a customer requirement
> that comes up later on.
>
> Anirudh
>
>
>
> On Wed, Feb 28, 2018 at 11:50 PM, Anirudh <[email protected]> wrote:
>
> > Hi Tianqi,
> >
> > What do you think about adding a separate parser for CSV with UTF8
> support
> > in dmlc-core? We can then just add a flag in MXNet for UTF8 and use the
> > UTF8 or the ASCII parser based on this flag. (This idea was suggested by
> > Mu).
> >
> > I think there will be some small changes required to the base class
> > "TextParserBase" as the method "BackFindEndLine" will have more logic in
> it
> > to check for other code-points for line-breaks, which can be refactored.
> > This approach will likely retain the performance of the existing ASCII
> CSV
> > Parser, while allowing MXNet users to make the decision w.r.t usability
> > with UTF-8 CSV parser / performance with ASCII CSV parser.
> >
> > Thanks,
> > Anirudh
> >
> >
> > On Mon, Feb 26, 2018 at 5:18 PM, Anirudh <[email protected]> wrote:
> >
> >> Hi Marco,
> >>
> >> I understand that there needs to be a different discussion on strong
> >> dependency of mxnet and dmlc-core and how to fix it.
> >>
> >> Having said that, I think the goals of dmlc-core and mxnet are somewhat
> >> aligned. Posting in the MXNet dev list for this case
> >> is a good way to gather feedback from both the communities since I
> >> consider the MXNet community to be mostly a superset of the dmlc-core
> >> community.
> >>
> >> Anirudh
> >>
> >> On Mon, Feb 26, 2018 at 5:00 PM, Subramanian, Anirudh <
> [email protected]>
> >> wrote:
> >>
> >>> Hi Tianqi,
> >>>
> >>> The UTF-8 support would enable other formats like CSV more usable.
> >>> Otherwise, they have to handle normalizing their data in some way
> before
> >>> using mxnet.
> >>> I understand that there is a tradeoff here because of the efficiency
> >>> gains from the parser but the expectation of having to normalize their
> UTF-8
> >>> files may turn users away.
> >>>
> >>> Anirudh
> >>>
> >>> On 2/26/18, 3:54 PM, "[email protected] on behalf of Tianqi Chen" <
> >>> [email protected] on behalf of [email protected]> wrote:
> >>>
> >>>     Since LibSVM format is only going to involve numbers and possibly
> >>> ascii
> >>>     characters, is there any reason adding UTF-8 support? Note that
> >>>     generalization always comes with cost of efficiency and there is
> some
> >>>     effort spent on making parser fast
> >>>
> >>>     Tianqi
> >>>
> >>>     On Mon, Feb 26, 2018 at 3:38 PM, Anirudh <[email protected]>
> >>> wrote:
> >>>
> >>>     > Hi all,
> >>>     >
> >>>     > Currently there is no UTF-8 Support for LibSVM, LibFM or CSV Text
> >>> parsers.
> >>>     > I am currently working on adding UTF-8 support for Text parsers.
> >>> Since C++
> >>>     > doesn't have a great built-in support for UTF-8, I am looking at
> >>>     > third-party libraries which provide Unicode support. I am
> >>> considering ICU
> >>>     > currently. Any comments, suggestions, past experience, gotchas
> >>> about
> >>>     > unicode third party libraries or adding unicode support in
> general
> >>> is
> >>>     > highly appreciated.
> >>>     >
> >>>     > I have created an issue about the same:
> >>>     > https://github.com/dmlc/dmlc-core/issues/372
> >>>     > Please feel free to reply to this email or comment on the github
> >>> issue if
> >>>     > you have any inputs.
> >>>     >
> >>>     > Anirudh
> >>>     >
> >>>
> >>>
> >>>
> >>
> >
>

Re: UTF-8 Support for TextParser

Reply via email to