On Thu, Sep 4, 2014 at 3:10 AM, Martin Andreas Andersen <
[email protected]> wrote:

> Updated beancount, everything works as expected. However, if I add the
> directive
>
> 1984-01-01 open Aktiver:NørresundbyBank:Nemkonto
>
> i get this error:
>
>
>> /home/martin/Dropbox/Documents/Finances/Budget/test.beancount:21:
>>  syntax error, unexpected ERROR, expecting ACCOUNT
>>
>> /home/martin/Dropbox/Documents/Finances/Budget/test.beancount:21:
>>  Lexer error; erroneous token: 'Aktiver:NørresundbyBank:Nemkonto'
>>
>  which seems to indicate the parser can't handle the 'ø'. For now, I can
> replace the non-unicode characters.
>

That's a good workaround.


How hard would it be to add unicode support? And where would I look in the
> source, if I wanted to hack at it?
>

I knew this day would come, but I did not expect it would be so soon.

So here's the context: I've been using flex and bison3, and the main reason
for that is that I'm really sensitive about dependencies. I feel that using
old tools that are available literally everywhere and the C language makes
it much easier to deal with the gigantic cosmic mess that is installation
and portability. So I've stuck with these old crochetty tools for a reason
(they're not even that easy to use, so it's actually a bit of a liability,
but I really like the ease of installation it procures, and you benefit
when all it takes is a 2 sec "make build" that just works so it's worth it
IMO).

Now, I've been unhappy with flex's ability to handle word boundaries and
error reporting, so I have considered manually writing my own lexer
recently, it's not very hard, just haven't had time. This could be done
with unicode support in C. Also, flex apparently produces 8-bit clean
output and technically it should be possible to make it grok UTF-8, but I
haven't tried it. As for bison, I think that if the lexer spits out wchar
tokens or whatever I don't really see why it shouldn't be able to handle
unicode. Python can convert encoded C through its API, so once the Python
builder callbacks get invoked, Bob's your uncle and everything else should
work. But it isn't a trivial project nor a small "few lines" patch.

The code is under
beancount/src/python/beancount/parser/...
look at files:
lexer.l
lexer.py
grammar.y
parser.c
parser.py

I'll consider anything if you wanted to submit a patch for such a big
change; overall my criteria for automatically including a large change in
the parsing tech stack are (1) C code only, no C++ or exotic languages, (2)
absolute minimal dependencies on third-party packages, all the best if even
the code it depends on is itself written in just C, and (3) it runs on
Linux & Mac OS X, or at least generates code that does.  Also, it should be
easy to write a second parser module in _parallel_ with the existing one
(e.g. what gets compiled as beancount/parser/_parser.so) and reusing all
the other bits of Python, the builder, etc. - we could very easily have two
or more parser implementations in the interest of transition and
experimentation.

Finally, my dev priorities at the moment: finish example & documentation,
then implement all reports to text, a filtering expression syntax,
implement the new inventory booking proposal (to support average cost
booking and cost basis adjustments), and then all the other stuff comes
after. So Unicode is farther down the line unfortunately. I hope you can be
happy with romanization of those characters for a bit.

So in summary: not a trivial project, if you can wait, I'll handle it
myself eventually (think: maybe within a year). If you can't wait, you're
welcome to have a go at it and send some code.

Thank you,

-- 

--- 
You received this message because you are subscribed to the Google Groups 
"Ledger" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Reply via email to