Re: [galaxy-dev] datatype dependencies

Peter Cock Thu, 17 Jul 2014 12:12:36 -0700

You could do something like that, and we already have
Biopython packages in the ToolShed which can be listed
as dependencies :)


However, some things like GenBank are tricky - in order
to tolerate NCBI dumps the Biopython parser will ignore
any free text before the first LOCUS line. A confusing
side effect is most text files are then treated as a
GenBank file with zero records. But if it came back
with some records it is probably OK :)

Basically Biopython also does not care to offer file
format detection simply because it is a can of worms.

Zen of Python - explicit is better than implicit.

We want you to tell us which format you want to try
parsing it as.

Sorry,

Peter
(Speaking as the Bio.SeqIO maintainer for Biopython)


On Thu, Jul 17, 2014 at 7:45 PM, Eric Rasche <[email protected]> wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Let's pretend for a second that I'm rather lazy (oh...wait), and I have
> ZERO interest in writing datatype parsers to sniff and validate whether
> or not a specific file is a specific datatype. I'm a sysadmin and
> bioinformatician, and I've worked with dozens of libraries that exist to
> parse file formats, and they all die in flames when I feed them bad data.
>
> Would it be possible to somehow define requirements for datatypes?
>
> I don't want to take on the burden of code I write saying "yes, I've
> sniffed+validated this and it is absolutely a genbank file". That's a
> lot of responsibility, especially if people have malformed genbank files
> and their tools fail as a result.
>
> I would like to do this with BioPython and turf the validation to
> another library that exists to parse genbank files, that will raise and
> exception if they're invalid.
>
>> def sniff(self, filename):
>>   from Bio import SeqIO
>>   try:
>>     self.records = list(SeqIO.parse( filename, "genbank" ))
>>     return True
>>   except:
>>     self.records = None
>>     return False
>>
>> def validate(self, dataset):
>>   from Bio import SeqIO
>>    errors = list()
>>   try:
>>     self.records = list(SeqIO.parse( dataset.file_name, "genbank" ))
>>   except Exception, e:
>>     errors.append(e)
>>   return errors
>>
>> def set_meta(self, dataset, **kwd):
>>   if self.records is not None:
>>     dataset.metadata.number_of_sequences = len(self.records)
>
> so much easier! And I can shift the burden of validation and sniffing to
> upstream, rather than any failures being my fault and requiring
> maintenance of a complex sniffer.
>
> Cheers,
> Eric
>
> - --
> Eric Rasche
> Programmer II
> Center for Phage Technology
> Texas A&M University
> College Station, TX 77843
> 404-692-2048
> [email protected]
> [email protected]
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v2.0.22 (GNU/Linux)
>
> iQIcBAEBAgAGBQJTyBmyAAoJEMqDXdrsMcpVQa0P/jj0edAKM6QsodhRWHglR92W
> tej1tJjtPgtJ15wsFzq6wVfhbL5J39ytsWjjtk//jhVNXh4FEE/OFZe6Nx9uTFKP
> ybazyTrLSCrxsST+w+Rx8Q9vfzShr87vjP+fC1k5i2EZOgogPOcQml1ouOHHjC6z
> pArrwPOvL3ZxWJG7oEcZjUjrPD8+ffhfQ/x096YYIMw7Hg74d50ARwtawJRoslZD
> JnYWa+aUOcsvC3QMrLKkDm4qBaTHa5x7x7P07Lcx7X65iMPDcuMZNtImiLztNscF
> QwbbdJdcs8oeSRRnmKgAllRAKf4dMeiyaSI+muVzNlpvLlSMZBNawD0bO1OXmIQH
> vAaV0eU+rYmDJSGo330o+RydvlDJENTXOkDt0TxmvfYAPtg2TlJCiWUdL7V1LqqF
> n8J5Z7Cu/sqRGSr5ww6KY27QHq6TU1WZDsVZiyEWJeKg3HGzp0MUmzMdr7iSZawK
> gnZxv6qg3+FlSqA30niyAuxEq588vS8uEFjjOfhnNLsUM7FAuFANF5z9bPOhG2qM
> Xjc3/NY7NsERd9nsIwfRuz0DWni8upvZ39vfeRZ3OAW9NwjRzqXrQiQp08XHa934
> z4EBnpcWc9rNSV/3APF/imecBTOoiKtZfzIfILLtOPGE407Bmd8cE8hWyW7ipvrT
> QU6DIimj3eoMn+elXDfX
> =M+s5
> -----END PGP SIGNATURE-----
> ___________________________________________________________
> Please keep all replies on the list by using "reply all"
> in your mail client.  To manage your subscriptions to this
> and other Galaxy lists, please use the interface at:
>   http://lists.bx.psu.edu/
>
> To search Galaxy mailing lists use the unified search at:
>   http://galaxyproject.org/search/mailinglists/
___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  http://lists.bx.psu.edu/

To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/mailinglists/

Re: [galaxy-dev] datatype dependencies

Reply via email to