Hi, Upon deeper understanding of customer requirement we found out that the customer uses only ASCII data with MXNet, just that they want the files containing UTF-8 BOM at the start and files with different control characters for newline to play well. dmlc-core already supports control characters for newline. Since, the UTF-8 BOM in files is a common use case for other users of MXNet too (for example, saving excel as UTF-8 csv) I will add support for handling the UTF-8 BOM in dmlc-core. I won't be working on UTF8CSVParser unless there is a customer requirement that comes up later on.
Anirudh On Wed, Feb 28, 2018 at 11:50 PM, Anirudh <anirudh2...@gmail.com> wrote: > Hi Tianqi, > > What do you think about adding a separate parser for CSV with UTF8 support > in dmlc-core? We can then just add a flag in MXNet for UTF8 and use the > UTF8 or the ASCII parser based on this flag. (This idea was suggested by > Mu). > > I think there will be some small changes required to the base class > "TextParserBase" as the method "BackFindEndLine" will have more logic in it > to check for other code-points for line-breaks, which can be refactored. > This approach will likely retain the performance of the existing ASCII CSV > Parser, while allowing MXNet users to make the decision w.r.t usability > with UTF-8 CSV parser / performance with ASCII CSV parser. > > Thanks, > Anirudh > > > On Mon, Feb 26, 2018 at 5:18 PM, Anirudh <anirudh2...@gmail.com> wrote: > >> Hi Marco, >> >> I understand that there needs to be a different discussion on strong >> dependency of mxnet and dmlc-core and how to fix it. >> >> Having said that, I think the goals of dmlc-core and mxnet are somewhat >> aligned. Posting in the MXNet dev list for this case >> is a good way to gather feedback from both the communities since I >> consider the MXNet community to be mostly a superset of the dmlc-core >> community. >> >> Anirudh >> >> On Mon, Feb 26, 2018 at 5:00 PM, Subramanian, Anirudh <ani...@amazon.com> >> wrote: >> >>> Hi Tianqi, >>> >>> The UTF-8 support would enable other formats like CSV more usable. >>> Otherwise, they have to handle normalizing their data in some way before >>> using mxnet. >>> I understand that there is a tradeoff here because of the efficiency >>> gains from the parser but the expectation of having to normalize their UTF-8 >>> files may turn users away. >>> >>> Anirudh >>> >>> On 2/26/18, 3:54 PM, "workc...@gmail.com on behalf of Tianqi Chen" < >>> workc...@gmail.com on behalf of tqc...@cs.washington.edu> wrote: >>> >>> Since LibSVM format is only going to involve numbers and possibly >>> ascii >>> characters, is there any reason adding UTF-8 support? Note that >>> generalization always comes with cost of efficiency and there is some >>> effort spent on making parser fast >>> >>> Tianqi >>> >>> On Mon, Feb 26, 2018 at 3:38 PM, Anirudh <anirudh2...@gmail.com> >>> wrote: >>> >>> > Hi all, >>> > >>> > Currently there is no UTF-8 Support for LibSVM, LibFM or CSV Text >>> parsers. >>> > I am currently working on adding UTF-8 support for Text parsers. >>> Since C++ >>> > doesn't have a great built-in support for UTF-8, I am looking at >>> > third-party libraries which provide Unicode support. I am >>> considering ICU >>> > currently. Any comments, suggestions, past experience, gotchas >>> about >>> > unicode third party libraries or adding unicode support in general >>> is >>> > highly appreciated. >>> > >>> > I have created an issue about the same: >>> > https://github.com/dmlc/dmlc-core/issues/372 >>> > Please feel free to reply to this email or comment on the github >>> issue if >>> > you have any inputs. >>> > >>> > Anirudh >>> > >>> >>> >>> >> >