For this, are you going to run the entire text through a converter, or just prepend the UTF-8 header to the file (0xEF,0xBB,0xBF)?
On Fri, Mar 9, 2018 at 12:43 PM, Anirudh <[email protected]> wrote: > Hi, > > Upon deeper understanding of customer requirement we found out that the > customer uses only ASCII data with MXNet, just that they want the files > containing UTF-8 BOM at the start and files with different control > characters for newline to play well. dmlc-core already supports control > characters for newline. > Since, the UTF-8 BOM in files is a common use case for other users of MXNet > too (for example, saving excel as UTF-8 csv) I will add support for > handling the UTF-8 BOM in dmlc-core. > I won't be working on UTF8CSVParser unless there is a customer requirement > that comes up later on. > > Anirudh > > > > On Wed, Feb 28, 2018 at 11:50 PM, Anirudh <[email protected]> wrote: > > > Hi Tianqi, > > > > What do you think about adding a separate parser for CSV with UTF8 > support > > in dmlc-core? We can then just add a flag in MXNet for UTF8 and use the > > UTF8 or the ASCII parser based on this flag. (This idea was suggested by > > Mu). > > > > I think there will be some small changes required to the base class > > "TextParserBase" as the method "BackFindEndLine" will have more logic in > it > > to check for other code-points for line-breaks, which can be refactored. > > This approach will likely retain the performance of the existing ASCII > CSV > > Parser, while allowing MXNet users to make the decision w.r.t usability > > with UTF-8 CSV parser / performance with ASCII CSV parser. > > > > Thanks, > > Anirudh > > > > > > On Mon, Feb 26, 2018 at 5:18 PM, Anirudh <[email protected]> wrote: > > > >> Hi Marco, > >> > >> I understand that there needs to be a different discussion on strong > >> dependency of mxnet and dmlc-core and how to fix it. > >> > >> Having said that, I think the goals of dmlc-core and mxnet are somewhat > >> aligned. Posting in the MXNet dev list for this case > >> is a good way to gather feedback from both the communities since I > >> consider the MXNet community to be mostly a superset of the dmlc-core > >> community. > >> > >> Anirudh > >> > >> On Mon, Feb 26, 2018 at 5:00 PM, Subramanian, Anirudh < > [email protected]> > >> wrote: > >> > >>> Hi Tianqi, > >>> > >>> The UTF-8 support would enable other formats like CSV more usable. > >>> Otherwise, they have to handle normalizing their data in some way > before > >>> using mxnet. > >>> I understand that there is a tradeoff here because of the efficiency > >>> gains from the parser but the expectation of having to normalize their > UTF-8 > >>> files may turn users away. > >>> > >>> Anirudh > >>> > >>> On 2/26/18, 3:54 PM, "[email protected] on behalf of Tianqi Chen" < > >>> [email protected] on behalf of [email protected]> wrote: > >>> > >>> Since LibSVM format is only going to involve numbers and possibly > >>> ascii > >>> characters, is there any reason adding UTF-8 support? Note that > >>> generalization always comes with cost of efficiency and there is > some > >>> effort spent on making parser fast > >>> > >>> Tianqi > >>> > >>> On Mon, Feb 26, 2018 at 3:38 PM, Anirudh <[email protected]> > >>> wrote: > >>> > >>> > Hi all, > >>> > > >>> > Currently there is no UTF-8 Support for LibSVM, LibFM or CSV Text > >>> parsers. > >>> > I am currently working on adding UTF-8 support for Text parsers. > >>> Since C++ > >>> > doesn't have a great built-in support for UTF-8, I am looking at > >>> > third-party libraries which provide Unicode support. I am > >>> considering ICU > >>> > currently. Any comments, suggestions, past experience, gotchas > >>> about > >>> > unicode third party libraries or adding unicode support in > general > >>> is > >>> > highly appreciated. > >>> > > >>> > I have created an issue about the same: > >>> > https://github.com/dmlc/dmlc-core/issues/372 > >>> > Please feel free to reply to this email or comment on the github > >>> issue if > >>> > you have any inputs. > >>> > > >>> > Anirudh > >>> > > >>> > >>> > >>> > >> > > >
