Hi,
I don’t really see how this could be mapped into a graph-related problem. I
would first try optimizing the original Python-based code to see if there is an
obvious bottleneck somewhere.
If your S1 and S2 datasets are fully available before you run your algorithm
and only a small fraction of the items are expected to have a non-empty
intersection, one possible trick that you can use to speed things up is to
create an “index map” for S1 and S2. For S1 (or S2), the index will be a
dictionary that maps each element X that appears anywhere within S1 (or S2) to
the indices of the rows that contain the element X in S1. Such indices can
easily be constructed using a defaultdict(list) and a single pass through S1 or
S2. Once you have the index maps for S1 and S2, you can realize that now it is
enough to do an iteration as follows:
initialize an empty score dictionary
for every item X in the index of S1:
if X is not a key in the index of S2:
continue
get the indices of the rows containing X in S1 from the index of S1
get the indices of the rows containing X in S2 from the index of S2
for each pair of the rows found in the previous two lines:
if the row pair is already in the score dictionary, continue
otherwise calculate the score of the row pair and store it in the score
dictionary
The above idea basically ensures that disjoint row pairs are never intersected
explicitly. The expense of this idea is that you have to create the index first
(which costs time and space) and you also have to keep track of the intersected
row pairs in the score cache (which also costs time and space) to ensure that
you do not intersect rows twice or more (if they have more than one common
item).
--
T.
On Sunday, 10 November 2013 at 13:06, Message wrote:
> Hi All,
>
> I've been trying to write some logic in Python to correlate two sets of
> simple data. For example:
>
> 1) set 1 (S1)
> apple, 5, 150
> apple, 3, 150
> orange, 8, 200
> orange, 5, 150
>
> 2) set 2 (S2)
> apple, 8, 200
> apple, 5, 150
> orange, 8, 100
> orange, 3, 150
>
>
> In the above the following should match up :
> S1- row 1 = S2 row 2 (100% match)
> S1- row 2 = S2 row 4 (66% match)
> S1- row 3 = S2 row 1 (66% match)
> S1- row 4 = S2 row 3 (33% match)
>
>
> I've written logic to do this, but it scales badly on larger sets of data due
> to the number of comparisons I need to make.
> I confess I'm not familiar with Graph Theory, but I get the general feeling
> it may be possible to emulate this type of correlation functionality.
>
> I'd be very grateful if somebody could give me some initial feedback /
> overview if this would be possible - and could igraph help?
>
> Thanks for any feedback. All the best,
> Marc
>
> _______________________________________________
> igraph-help mailing list
> [email protected] (mailto:[email protected])
> https://lists.nongnu.org/mailman/listinfo/igraph-help
>
>
_______________________________________________
igraph-help mailing list
[email protected]
https://lists.nongnu.org/mailman/listinfo/igraph-help