Ah that makes sense, thank you! Any recommendation on which algorithm works well?
On Tue, Dec 21, 2021 at 5:29 AM Daniele Nicolodi <[email protected]> wrote: > On 21/12/2021 00:55, Aaron Stacy wrote: > > Hi, I'm looking for suggestions for categorizing spending (not so much > > things like paycheck, brokerage transactions, etc, but stuff like credit > > card spending for budgeting). My ledger has around 2800 transactions > > over about 2 years, so it's not a ton of data, but it seems like enough > > that I could leverage something smarter than just string matching > > the transaction narrations. > > > > Does anyone have recommendations for categorizing spending? > > > > I'm thinking of applying a full text search index as follows: > > > > - Each expense account is a "document". > > - The document contents is the narration of every transaction for that > > account. > > - To categorize a new transaction, use an engine like Lucene > > <https://lucene.apache.org> to or sklearn.TfidfVectorizer > > <http://sklearn.TfidfVectorizer> and pick the most likely account. > > > > Any thoughts on this approach? (aside from being over-engineered. I'm an > > engineer, IDK what to tell you it's what I do) > > I use Beancount and to assign accounts to transactions I use a machine > learning classifier trained on my existing ledger implemented using > sklearn. > > This works reasonably well for recurring transactions but is not > infallible. I found that putting a threshold on the confidence score > from the classifier is essential for not ending up with completely bogus > account assignments. > > Cheers, > Dan > -- --- You received this message because you are subscribed to the Google Groups "Ledger" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/ledger-cli/CACjABkk3stsisCMOcWfmjoq438Lq65PvkQ%2B201mB8a_ZUXVTiw%40mail.gmail.com.
