On Thu, Nov 21, 2013 at 6:11 PM, Peter Cock <p.j.a.c...@googlemail.com> wrote:
> On Thu, Nov 21, 2013 at 5:59 PM, Dooley, Damion <damion.doo...@bccdc.ca> 
> wrote:
>> I hear you, re. guessing about data - it just sounded like this would be a
>> rare case.  Is it happening on particular database searches?  Now that I
>> look at it I'm wondering in what situation the IndexError would be triggered.
>> I'm diving into the details here just because I don't want to discover later
>> on there that I'd made some assumptions about the id parsing.
>
> Yes, it is rare - but the fix was triggered by falling over the following
> example from a BLAST against the NR database, shown in the commit
> comment:
>
> https://github.com/peterjc/galaxy_blast/commit/5210af6622bf905ecb09ffbf6d7d348cfe015dc3
>
>         <Hit>
>           <Hit_num>146</Hit_num>
>           <Hit_id>gi|157832142|pdb|1NKD|A</Hit_id>
>           <Hit_def>Chain A, Atomic Resolution (1.07 Angstroms)
> Structure Of The Rop Mutant &lt;2aa&gt; &gt;gi|157833740|pdb|1RPO|A
> Chain A, Restored Heptad Pattern Continuity Does Not Alter The Folding
> Of A 4- Alpha-Helical Bundle</Hit_def>
>           <Hit_accession>1NKD_A</Hit_accession>
>           <Hit_len>65</Hit_len>
>
> Spliting on just the greater than sign broke on the <2aa> comment. Splitting 
> on
> space then greater than sign is slightly less fragile.
>
> Ideally this multi-entry field would be presented explicitly in the XML,
> something I suggested in passing on this related blog post:
> http://blastedbio.blogspot.co.uk/2012/05/blast-tabular-missing-descriptions.html
>
> You can see the problem entry like this:
>
> $ blastdbcmd -entry 157832142 -db nr -outfmt "%t"
> Chain A, Atomic Resolution (1.07 Angstroms) Structure Of The Rop Mutant <2aa>
> Chain A, Restored Heptad Pattern Continuity Does Not Alter The Folding
> Of A 4- Alpha-Helical Bundle
>
> To see if there are any more naught entries in the NR database, I am trying
> this command (no output yet, might take a while though):
>
> $ time blastdbcmd -entry all -db nr -outfmt "%t" | grep ">"
> ...
>

With hindsight, I should have asked this to include the protein ID too,
but in any case there are about 2650 examples, most using an arrow
like --> or -> but quite a few HTML italics tags too - plus things like the
<2aa>, <R> and <ESP> as well:

$ time blastdbcmd -entry all -db nr -outfmt "%t" | grep ">"
Alpha (1->4) glucosyltransferase [Mycobacterium tuberculosis H37Rv]
Alpha (1->4) glucosyltransferase [Mycobacterium tuberculosis H37Rv]
ssDNA exonuclease, 5' --> 3'-specific [Escherichia coli str. K-12
substr. MG1655]
ssDNA exonuclease, 5' --> 3'-specific [Escherichia coli str. K-12 substr. W3110]
...
PREDICTED: putative C->U-editing enzyme APOBEC-4-like, partial
[Alligator sinensis]

real    1842m2.024s
user    1835m55.169s
sys    15m17.406s

(i.e. about 30 hours to dump all the titles from the NR database).

The bad news (in terms of splitting the XML description) is there are
currently 30 examples with " >" (space greater-than) in the description
(many look purely accidental due to past line wrapping I would guess):

Chain B, Crystal Structure Of Trypsin Complexed With The Bpti Variant
(Tyr35- >gly)
Chain A, Structural Characterization Of Heme Ligation In The His64--
>tyr Variant Of Myoglobin
Chain L, Fv Fragment Of Mouse Monoclonal Antibody D1.3 (BalbC, IGG1,
K) Engineered Mutant Pro95l->ser On Variant Chain L Glu81- >asp And
Chain H Leu312->val
Chain A, Alteration Of Axial Coordination By Protein Engineering In
Myoglobin. Bis-Imidazole Ligation In The His64--
>val(Slash)val68-->his Double Mutant
Chain B, Alteration Of Axial Coordination By Protein Engineering In
Myoglobin. Bis-Imidazole Ligation In The His64--
>val(Slash)val68-->his Double Mutant
Very similar to alpha-NACs, (Nascent polypeptide > [Arabidopsis thaliana]
Chain A, Golgi Mannosidase Ii D204a Catalytic Nucleophile Mutant
Complex With 
Methyl(Alpha-D-Mannopyranosyl)-(1->3)-S-[(Alpha-D-Mannopyranosyl)-(1-
>6)]-Alpha-D-Mannopyranoside
putative NADH dehydrogenase Fe-S protein 7 >Feature gb|DQ213771
[Taeniopygia guttata]
Chain A, Crystal Structure Of Mouse Aurora A (asn186->gly,
Lys240->arg, Met302- >leu) In Complex With
1-{5-[2-(thieno[3,2-d]pyrimidin-4-ylamino)- Ethyl]-
Thiazol-2-yl}-3-(3-trifluoromethyl-phenyl)-urea
Chain A, Crystal Structure Of Mouse Aurora A (asn186->gly,
Lys240->arg, Met302- >leu) In Complex With
1-(3-chloro-phenyl)-3-{5-[2-(thieno[3,2- D]pyrimidin-4-ylamino)-
Ethyl]-thiazol-2-yl}-urea [sns-314]
Chain A, Crystal Structure Of Mouse Aurora A (Asn186->gly, Lys240-
>arg, Met302->leu) In Complex With 1-{5-[2-(1-Methyl-1h-
Pyrazolo[4,3-D]pyrimidin-7-Ylamino)-Ethyl]-Thiazol-2-Yl}-3-
(3-Trifluoromethyl-Phenyl)-Urea
Chain A, Crystal Structure Of Mouse Aurora A (Asn186->gly, Lys240-
>arg, Met302->leu) In Complex With [7-(2-{2-[3-(3-Chloro-
Phenyl)-Ureido]-Thiazol-5-Yl}-Ethylamino)-Pyrazolo[4,3-
D]pyrimidin-1-Yl]-Acetic Acid
Glutaminase > [Sulfurimonas gotlandica GD1]
exonuclease VIII 5 > 3 specific dsDNA exonuclease [uncultured phage
MedDCM-OCT-S12-C102]
exonuclease VIII 5 > 3 specific dsDNA exonuclease [uncultured organism
MedDCM-OCT-S08-C700]
transporter, probably Low affinity (KM > 3mM) ammonia uptake carrier,
AmtB [Bifidobacterium asteroides PRL2011]
transporter, probably Low affinity (KM > 3mM) ammonia uptake carrier,
AmtB [Bifidobacterium asteroides PRL2011]
Chain H, Fv Fragment Of Mouse Monoclonal Antibody D1.3 (BalbC, IGG1,
K) Engineered Mutant Pro95l->ser On Variant Chain L Glu81- >asp And
Chain H Leu312->val
Chain A, Solution Structure Of The Monomeric [thr(B27)->pro,Pro(B28)-
>thr] Insulin Mutant (Pt Insulin)
Chain A, Crystal Structure Of Human Neutrophil Peptide 2, Hnp-2
(variant Gly16- > D-ala)
Chain B, Crystal Structure Of Human Neutrophil Peptide 2, Hnp-2
(variant Gly16- > D-ala)
Chain C, Crystal Structure Of Human Neutrophil Peptide 2, Hnp-2
(variant Gly16- > D-ala)
Chain D, Crystal Structure Of Human Neutrophil Peptide 2, Hnp-2
(variant Gly16- > D-ala)
putative oxidase > apramycin biosynthesisN-methyltransferase
[Streptoalloteichus tenebrarius]
First of four adjacent putative subtilase family > [Arabidopsis thaliana]
Third of four adjacent putative subtilase family > [Arabidopsis thaliana]
putative ORF >60AA, partial [Escherichia coli]
putative ORF >24AA, partial [Escherichia coli]
Chain B, Solution Structure Of The Monomeric [thr(B27)->pro,Pro(B28)-
>thr] Insulin Mutant (Pt Insulin)
Chain A, Crystal Structure Of Trypsin Complexed With The Bpti Variant
(Tyr35- >gly)

Regards,

Peter
___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  http://lists.bx.psu.edu/

To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/mailinglists/

Reply via email to