Hello, as you may be aware from recent messages on the mailing list, Beancount's importers mechanism has been split off from the application core and there is an ongoing effort to revamp the interfaces with the aim of simplifying them while allowing to more finely customize the import mechanism when needed.
The existing importers API uses an interface where file operations caching and file encoding detection are strongly coupled. I would like to change that, making caching opt-in and removing transparent support for encoding detection. The rationale for this is that there should not be any practical advantage in trying to outsmart the OS in file caching, especially for the file sizes and access patterns typical for Beancount importers. On the other hand I don't know how many rely on the transparent encoding detection. Unless I am missing something, text encoding is relevant only for CSV importers or other importers dealing with free form text, but not for importers dealing with more structured text formats or PDFs. While I had to cope with CSV files from banks in funny encodings and served with the wrong MIME headers, I never had institutions change the encoding used based on some unpredictable condition. Therefore I assume that determining the encoding of a document is a one time thing that is done once at importer writing time. Encoding detection is statistical in nature, thus there is no guarantee that two samples of the same encoding are detected in the same way. This may be a source of bugs, thus it would be preferable to have the encoding used by a given importer to be fixed. In conclusion, I would like to remove transparent encoding detection from the import framework. Importers authors that want to validate the encoding of their input files can use the chardetect script from the chardet python package. Would this simplification hurt any real use case? Thank you. Cheers, Dan -- You received this message because you are subscribed to the Google Groups "Beancount" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/beancount/e3fd2018-038d-7087-9a2b-8483541639f7%40grinta.net.
