Re: [GNC-dev] Fwd: Is the import match map still required?

2020-05-24 Thread David Cousens


1. Not much we can do about that. Presumably that is why the import match
editor was created so that fat finger tokens can be located and deleted. I
still have mixed feelings about pruning the table. Ok in the case where you
have a known wrong entry as above but less sure whether taking connector
tokens out will not adversely affect the ability to score higher on a phrase
that is used consistently in a description/memo field for example.

2. I suspect doing this on the fly would create too much of a performance
hit as some/many people have large files with thousands of transactions as
GnuCash does not require new file creation annually. I would build a
procedure that can be run on an account whenever desired to recreate the
frequency table data from the existing transaction transfer accounts and
replace the existing data. User's need to select the account to run it for
and the date range from which to use transactions to construct the table for
the cases where 5 years agon someone used a different account structure.
Should not be too hard as the processes for tokenizing transactions already
exist in the matcher code. If it can be run as a standalone then it can be
tested to see what effect it will have if it was run on the fly during
import. 

3. Not sure on that. I think it is likely that only the transaction data is
moved to the new account but not sure. The data may be all read into memory
initially so it shouldn't be too hard to write it to a merged account. On
the otherhand if a standalone as in 2 is created all that is needed is to
execute it after the merge and Bob's your uncle.

David



-
David Cousens
--
Sent from: http://gnucash.1415818.n4.nabble.com/GnuCash-Dev-f1435356.html
___
gnucash-devel mailing list
gnucash-devel@gnucash.org
https://lists.gnucash.org/mailman/listinfo/gnucash-devel


Re: [GNC-dev] Import Map Editor - maintain position in list

2020-05-24 Thread David Cousens
Chris 

I think that is a good move. Having to continually refind my position in the
list after deleting tokens was a real frustration while pruning the token
list manually. Not touching that area.

David



-
David Cousens
--
Sent from: http://gnucash.1415818.n4.nabble.com/GnuCash-Dev-f1435356.html
___
gnucash-devel mailing list
gnucash-devel@gnucash.org
https://lists.gnucash.org/mailman/listinfo/gnucash-devel


[GNC-dev] Import Map Editor - maintain position in list

2020-05-24 Thread Chris Good
Hi,

 

I'd like to modify the Import Map Editor so that after deleting the selected
entries, you do not get reset back to the start of the list.

Is anybody already working on that or some other major change to the Import
Map Editor?

 

Regards,

Chris Good

 

___
gnucash-devel mailing list
gnucash-devel@gnucash.org
https://lists.gnucash.org/mailman/listinfo/gnucash-devel


Re: [GNC-dev] Is the import match map still required?

2020-05-24 Thread Chris Good
Message: 5
Date: Sun, 24 May 2020 15:44:48 +1000
From: flywire 
To: gnucash-devel 
Subject: [GNC-dev] Fwd:  Is the import match map still required?
Message-ID:

Content-Type: text/plain; charset="UTF-8"

The most obvious match would be to match any Transfer Accounts in the data
to gnu Accounts, even if the result needs to be verified.

Other comments:
1) User's rapid clicks can unintentionally select the wrong account,
mapping invalid data
2) Seems there could be an opportunity for user to re-run a process to
recreate map and prune the useless matches David refers to ( dates,
connectors (a, and, the etc.), transaction amounts ?). With enough
transactions this should be pretty good.
3) I assume the table is updated with merged accounts,

...
4) Assuming match is case sensitive, should it be optionally turned off?


Hi flywire,

Re matching input file accounts to GnuCash accounts:
I guess this would only apply to QIF or CSV imports, but sounds like a good
idea.

1) & 2) You can always run the Import Map Editor to delete bad matches.
I've always thought it would be a good idea if there was a parameter for the
minimum token length the Bayesian matching would consider, so I could get it
to ignore silly data that in no way helps a correct match, like date
separator "/", "-", dd or mm or yy or .
It would also be useful but a fair but of work to have screen where you
could enter string tokens to be ignored, like "Receipt", "September" etc.

3) What does that mean "I assume the table is updated with merged accounts"?
If you mean when you delete an account, and elect to move all the
transactions to another account, I've got no idea but would be easy enough
to test. Not sure it would be worth the effort as it is easy enough to build
up mapping history again. The problem is to make the mapping history useful.

4) Possibly useful

Regards, Chris Good

___
gnucash-devel mailing list
gnucash-devel@gnucash.org
https://lists.gnucash.org/mailman/listinfo/gnucash-devel


Re: [GNC-dev] Is the import match map still required?

2020-05-24 Thread David Cousens
Christian,

I haven't experimented to know whether constructing the frequency table on
the fly creates a performance bottleneck or not but am guessing the original
developer thought it might. It would require a detailed look at the code
involved but my suspicion would be that the performance penalty is likely to
be significant.

My comment about bloat is that at present data is only maintained for
accounts you specifically import data into and if that data is stored. If it
isn't then bloat doesn't apply obviously. Any sort of generalized procedure
could allow selection of accounts for which Bayesian matching is required,
i.e. those for which importing is used to input data. My initial thought was
that you would run it for all accounts but it is really only necessary for
the specific subset of accounts into which you import data. It would then
require the ability to run the procedure on an account if it occurred in
import data but didn't have existing account matching data. If it is on the
fly then no problem it can run whenever a new account being imported into
appears in the imported data. The most common use case is probably importing
data to one specific account but GnuCash is also able to specify the account
being imported into in the import data itself.  I haven't looked at how the
frequency table is currently stored in memory but I am guessing it is
constructed in memory when the data file is read in.

The up-to-date aspect is one advantage and if the current procedure  is
changed to improve performance then that is not hampered by the presence of
historical data which would be updated automatically when the procedure is
run. If the table is stored as it is at present and a procedure was
available to trawl the current transactions for an account then it can be
kept up to date by running that procedure periodically. the use of data from
manually entered transactions would then be incorporated whether on the fly
or just run as required.

Having a standalone procedure to trawl an existing file to update the stored
data for an account  would allow exploration of whether this is likely to be
a significant performance hit if it was run on the fly so that could perhaps
be a first step.  The core part of the code to store the data has to exist
in the matcher code already and it will be a case of wrapping this in a loop
through the transactions existing in an account and setting up the gui
interface to select accounts to run on.

The problem with pruning the data is that GnuCash has no way of knowing
apriori which tokens are most relevant. I would think that date information
is not really relevant and amount/value information does little in most
cases to identify a transfer account. 

The main difficulty I have  with transfer account assignment is that some
regular transactions use a unique code in the description each time they
occur with no separate unique identifier of the transaction source. My wife
and I both have separte gym membership subscriptions and the transaction
descriptions neither identify the gym or for which of us  the transaction
applies. Options are to persuade the source to include specific data or only
use a single account to record both but I like to track both our individual
and joint expenses

Some regular transactions also get matched to previous payments in the
transaction matching within the date range window where the amounts and
descriptions are usually identical. The current 42 day window captures both
fortnightly and monthly regular income transactions for example.  This only
affects a few transactions each month and I don't have huge numbers of
transactions to process now that I have retired but that may not be the case
for other users. Maybe making the date range window adjustable rather than
fixed might be a cure for this. Setting it at <14 days would cure the
problems I have for example, but that again would not work for everybody.

I am currently committed to a bit on the documentation front so I will be
unlikey to consider this for the near future in other than general terms but
someone else may be willing to take it up.

David



-
David Cousens
--
Sent from: http://gnucash.1415818.n4.nabble.com/GnuCash-Dev-f1435356.html
___
gnucash-devel mailing list
gnucash-devel@gnucash.org
https://lists.gnucash.org/mailman/listinfo/gnucash-devel


Re: [GNC-dev] Is the import match map still required?

2020-05-24 Thread Christian Gruber



Am 24.05.20 um 01:52 schrieb David Cousens:

Christian,

I guess it depends on whether there is a performance advantage in using the
previously stored data for the transfer account associations over
constructing the frequency table on the fly. The search for matching
transactions only takes place within a narrow time window around the date of
import, so it is unlikely to canvas enough transactions to be able to
construct a valid frequency table from tokenized data within that window.
The stored frequency table would generally contain data from a much wider
range of transactions and would take much longer to construct on the fly
each time it was needed.
I'm only thinking about account matching (bayesian matching), not 
transaction matching. For this of course it would be necessary to work 
with all historical data, not only with a few transactions within a 
narrow time window. Can you tell, if it would be a considerable 
performance load to construct the frequency table on the fly from all 
historical transactions related to a transfer account?

I have also pondered whether it could be usefully augmented by using data
from transactions entered manually which have not been imported for the file
associations.  Could be of value where you have a good set of historical
records but it would only need a one off run through the existing
transactions to gather the data. Unless you confined it to running on a
specific set of accounts to which you import data it might cause bloat of
the data file with unnecessary and unused information.


A possible advantage of constructing the frequency table on the fly 
could be, that it is always up-to-date. If the user sets the "wrong" 
other account during import for instance and corrects this after the 
import, the import match map still contains the wrong matching 
information at the moment and will also not be corrected after the import.


Also manually entered transactions would be considered, right.

A one-off manual run through all transactions to update the import match 
map could be a good alternative to constructing it on the fly. Sounds good.


Why do you think, a run through all transactions "might cause a bloat of 
the data file"? The current import match map also contains all, maybe 
unused or unnecessary data from all matched accounts. I still assume in 
this case, that the import match map is related to one transfer account 
only, which already limits the set of accounts from which the import 
match map is constructed.



I have examined the stored data in my  data file with the import map editor
and found that there was a lot of data stored which contributes little to
the matching for the transfer account ( dates, connectors (a, and, the
etc.), transaction amounts ?) which often have a fairly uniform frequency
for all accounts which were used as transfer accounts. After a bit of
pruning of the stored data my matching reliability seemed to improve a bit.
Ok, I see. If the import match map has to be pruned to get reliable 
results from the bayesian matching algorithm, a frequency table, which 
is constructed on the fly or is rebuilt on a one-off run, is a big 
disadvantage. If it is constructed on the fly, nothing can be pruned. 
And if it is rebuilt, all pruned data will back after the run.

I don't know at the moment if the tokens stored for transfer account
matching are a subset of the tokens used for transaction matching (haven't
checked) but restricting the set of tokens used may possibly improve
performance and reduce the amount of data stored if all tokens associated
with a transaction are currently being stored in the frequency table which
is what I suspect from examining my import map data.
Yes, this is the current situation, every token is stored. Do you have 
suggestions, how tokens could be automatically pruned in a meaningful way?


David Cousens



-
David Cousens
--
Sent from: http://gnucash.1415818.n4.nabble.com/GnuCash-Dev-f1435356.html
___
gnucash-devel mailing list
gnucash-devel@gnucash.org
https://lists.gnucash.org/mailman/listinfo/gnucash-devel

___
gnucash-devel mailing list
gnucash-devel@gnucash.org
https://lists.gnucash.org/mailman/listinfo/gnucash-devel


Re: [GNC-dev] Is the import match map still required?

2020-05-24 Thread flywire
4) Assuming match is case sensitive, should it be optionally turned off?
___
gnucash-devel mailing list
gnucash-devel@gnucash.org
https://lists.gnucash.org/mailman/listinfo/gnucash-devel