Encoding detection in importers

Daniele Nicolodi Mon, 15 Feb 2021 08:00:51 -0800

Hello,

as you may be aware from recent messages on the mailing list,
Beancount's importers mechanism has been split off from the application
core and there is an ongoing effort to revamp the interfaces with the
aim of simplifying them while allowing to more finely customize the
import mechanism when needed.

The existing importers API uses an interface where file operations
caching and file encoding detection are strongly coupled. I would like
to change that, making caching opt-in and removing transparent support
for encoding detection.

The rationale for this is that there should not be any practical
advantage in trying to outsmart the OS in file caching, especially for
the file sizes and access patterns typical for Beancount importers. On
the other hand I don't know how many rely on the transparent encoding
detection.

Unless I am missing something, text encoding is relevant only for CSV
importers or other importers dealing with free form text, but not for
importers dealing with more structured text formats or PDFs. While I had
to cope with CSV files from banks in funny encodings and served with the
wrong MIME headers, I never had institutions change the encoding used
based on some unpredictable condition. Therefore I assume that
determining the encoding of a document is a one time thing that is done
once at importer writing time.

Encoding detection is statistical in nature, thus there is no guarantee
that two samples of the same encoding are detected in the same way. This
may be a source of bugs, thus it would be preferable to have the
encoding used by a given importer to be fixed.

In conclusion, I would like to remove transparent encoding detection
from the import framework. Importers authors that want to validate the
encoding of their input files can use the chardetect script from the
chardet python package.

Would this simplification hurt any real use case?

Thank you.

Cheers,
Dan

--
You received this message because you are subscribed to the Google Groups
"Beancount" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To view this discussion on the web visit
https://groups.google.com/d/msgid/beancount/e3fd2018-038d-7087-9a2b-8483541639f7%40grinta.net.

Encoding detection in importers

Reply via email to