On Thu, Sep 4, 2014 at 3:10 AM, Martin Andreas Andersen < [email protected]> wrote:
> Updated beancount, everything works as expected. However, if I add the > directive > > 1984-01-01 open Aktiver:NørresundbyBank:Nemkonto > > i get this error: > > >> /home/martin/Dropbox/Documents/Finances/Budget/test.beancount:21: >> syntax error, unexpected ERROR, expecting ACCOUNT >> >> /home/martin/Dropbox/Documents/Finances/Budget/test.beancount:21: >> Lexer error; erroneous token: 'Aktiver:NørresundbyBank:Nemkonto' >> > which seems to indicate the parser can't handle the 'ø'. For now, I can > replace the non-unicode characters. > That's a good workaround. How hard would it be to add unicode support? And where would I look in the > source, if I wanted to hack at it? > I knew this day would come, but I did not expect it would be so soon. So here's the context: I've been using flex and bison3, and the main reason for that is that I'm really sensitive about dependencies. I feel that using old tools that are available literally everywhere and the C language makes it much easier to deal with the gigantic cosmic mess that is installation and portability. So I've stuck with these old crochetty tools for a reason (they're not even that easy to use, so it's actually a bit of a liability, but I really like the ease of installation it procures, and you benefit when all it takes is a 2 sec "make build" that just works so it's worth it IMO). Now, I've been unhappy with flex's ability to handle word boundaries and error reporting, so I have considered manually writing my own lexer recently, it's not very hard, just haven't had time. This could be done with unicode support in C. Also, flex apparently produces 8-bit clean output and technically it should be possible to make it grok UTF-8, but I haven't tried it. As for bison, I think that if the lexer spits out wchar tokens or whatever I don't really see why it shouldn't be able to handle unicode. Python can convert encoded C through its API, so once the Python builder callbacks get invoked, Bob's your uncle and everything else should work. But it isn't a trivial project nor a small "few lines" patch. The code is under beancount/src/python/beancount/parser/... look at files: lexer.l lexer.py grammar.y parser.c parser.py I'll consider anything if you wanted to submit a patch for such a big change; overall my criteria for automatically including a large change in the parsing tech stack are (1) C code only, no C++ or exotic languages, (2) absolute minimal dependencies on third-party packages, all the best if even the code it depends on is itself written in just C, and (3) it runs on Linux & Mac OS X, or at least generates code that does. Also, it should be easy to write a second parser module in _parallel_ with the existing one (e.g. what gets compiled as beancount/parser/_parser.so) and reusing all the other bits of Python, the builder, etc. - we could very easily have two or more parser implementations in the interest of transition and experimentation. Finally, my dev priorities at the moment: finish example & documentation, then implement all reports to text, a filtering expression syntax, implement the new inventory booking proposal (to support average cost booking and cost basis adjustments), and then all the other stuff comes after. So Unicode is farther down the line unfortunately. I hope you can be happy with romanization of those characters for a bit. So in summary: not a trivial project, if you can wait, I'll handle it myself eventually (think: maybe within a year). If you can't wait, you're welcome to have a go at it and send some code. Thank you, -- --- You received this message because you are subscribed to the Google Groups "Ledger" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/d/optout.
