Re: [Jalview-dev] slow loading of 300000+ seq fasta file

Jim Procter Wed, 31 May 2017 12:38:30 -0700

Well hunted, Kira.

On 31/05/2017 16:17, Kira Mourao (Staff) wrote:
> I’ve just found the reason the file I was looking at was not loading (or
> actually is loading but extremely slowly). The bad news is it looks like
> it’s been in the code since Aug 2016 but the good news is it looks very
> fixable.
:)


> The initialisation is being held up in
> Alignment::resolveAndAddDatasetSeq which is in the call stack called by
> the AlignFrame initialisation code.
This was added to avoid duplicate sequence import when opening Ensembl
or ENA CDS, if I remember correctly (though Mungo may have a better story).

> The reason seqs.contains is slow is because, despite the name,
> LinkedIdentityHashSet::contains is doing a linear search. This rather
> echoes what I was saying earlier about checking our data structures are
> appropriate.
natch.

> I’ll log a JIRA issue for this. It would be useful to know what the
> purpose of using LinkedIdentityHashSet here was though, as this is the
> only place it’s used in the code.

The use of IdentityHash was to spot duplicates based on the Object
reference (ie equivalence based on == rather than .equals() ). However,
I'd have hoped the contains would not simply do linear search. ISTR a
LinkedHashSet was chosen for order preservation, which made life easier
for the CDS/Splitframe logic.

Some relevant issues: JAL-2132, which may have been the original reason
for this bit of logic back in 2016. That issue is overshadowed by the
real requirement: full normalisation (JAL-407).

I was idly googling IdentityHashMap to see if there are any workarounds.
We could simply enforce primary keys and hash on those
(SequenceI.getVamsasId() would fit that), but I also found this library
https://bitbucket.org/trove4j/trove
via http://java-performance.info/java-util-identityhashmap/.

..Jim.

_______________________________________________
Jalview-dev mailing list
[email protected]
http://www.compbio.dundee.ac.uk/mailman/listinfo/jalview-dev

Re: [Jalview-dev] slow loading of 300000+ seq fasta file

Reply via email to