On Tue, Feb 19, 2013 at 11:32 AM, Peter Cock <p.j.a.c...@googlemail.com> wrote:
> Hello all,
>
> Although they are these days also offering XML for many tools,
> the NCBI still make heavy use of the older ASN.1 file format
> (both as plain text and binary). This crops up in BLAST (e.g.
> as the BLAST archive format, or as dustmasker output), in
> the Entrez Utilities (e.g. for sequence data as an alternative
> to GenBank for FASTA format etc, or pubmed, etc) and also
> for 3D structures.
>
> I think it could make sense to define generic 'asn1' and
> 'asn1-binary' formats in the Galaxy core (name suggestions
> welcome), and even perhaps 'ncbi-asn1' and 'ncbi-asn1-binary'
> too. Then ToolShed entries can define domain specific
> subclasses. For instance, the BLAST+ wrapper could include
> definitions for the dustmasker output, and perhaps the BLAST
> archive format too. Separately anyone working with 3D
> structures as ASN.1 could define another sub-format, etc.
>
> I see this as a clear analogy to the assorted XML file formats
> in existence - defined in Galaxy as subclasses of the core
> XML format included with the Galaxy core.
>
> Would a pull request implementing this be acceptable?
>
> Peter
>
> P.S. Does anyone know an authoritative source for the MIME
> types used by the NCBI? Using the BLAST website they
> offer plain text ASN.1 just as text/plain, likewise efetch also
> seems to use text/plain for ASN.1 downloads. However I've
> seen references to chemical/ncbi-asn1-ascii and
> chemical/ncbi-asn1-binary mime-types mentioned, e.g.
> http://www.ncbi.nlm.nih.gov/data_specs/asn/NCBI_all.asn
>
> i.e. It appears that 3D structure NCBI ASN.1 files use
> a well defined MIME type, while most NCBI ASN.1 text
> files default to text/plain - which we can handle nicely in
> Galaxy as subclasses.

I had an interesting email about the NCBI ASN.1 files from
Christopher Hogue, which (with his blessing) I am forwarding
to the list in case anyone else is interested - see below.

Thanks Christopher,

Peter

---------- Forwarded message ----------
From: Christopher Hogue
Date: Wed, May 1, 2013 at 4:06 PM
Subject: ASN.1 (text and binary) formats  - Use.
To: p.j.a.c...@googlemail.com


I can offer a bit of insight here about NCBI Asn.1 as it is confusing.

(Sorry about the time delay here - I'm not a Galaxy developer - I just
picked up on this thread via Google today.)

I was involved with the origin of the chemical/ncbi-asn1-binary back
when I wrote Cn3D 1.0 in 1996 - http://www.ch.ic.ac.uk/chemime/ which
was managed by H. Rzepa

In case this is tl;dr - you have done a sensible thing - I don't
suggest you change anything you have implemented.

This is just to let you know what is involved in reading NCBI emitted
Asn.1 types - if that is what you want to do eventually.

The Asn.1 specification here:
http://www.ncbi.nlm.nih.gov/data_specs/asn/NCBI_all.asn

Refers to the the current NCBI Mime-types:
(chemical/ncbi-asn1-ascii and chemical/ncbi-asn1-binary).

Yes, originally this was set up for 3D structures.  NCBI used *.val
for binary and *.prt for ascii forms, and these were wrapped with the
obsolete mime-types chemical/x-ncbi-asn1-binary and
chemical/x-ncbi-asn1-ascii respetively. The only exported data at the
time was MMDB/Cn3D structure data.

Newer versions of Cn3D started taking in sequences and sequence
alignments generated by VAST structure superposition. Now these
original file extensions could hold any exported Asn.1 symbol in their
spec set, so the *.val/*.prt file could hold anything from pubmed
types to 3D structure to sequences to a blast output, or various
nested fragments.

BUT Unlike XML, Asn.1 data does not point back to its own schema, and
the binary files are trouble if you lose track of what top-level
symbol type they start at. There is no observable metadata. There is
no automated way to pull down the specification from a URI. This is
difficult to deal with without changing the NCBI Asn.1 1990
implementation itself.

So to fix Cn3D so it could import more types, they made a Asn.1
wrapper which is in the NCBI toolkit as "ncbimime.asn". This was
intended to be a catch-all wrapper for all the data types they emit
for Cn3D. The idea was that all their emitted spec objects would be
triaged the same way, by parsing the top-level part of this piece of
Asn.1, then you could figure out what the data was inside.

In practice this is still used by Cn3D, and as far as I know Cn3D is
the only external NCBI application still supporting this. They also
introduced *.cn3 (removing *.val and *.prt) to specify a stream
(either binary or ascii) that has this top-level Mime wrapper. In one
case they emit the wrong extension *.c3d on the VAST structure
similarity server.

The problem with the mime-wrapper they "spec'ed" in, is that - it only
wraps their types, not any arbitrary Asn.1 object that might be made
with NCBI tools.

If you want to know more about Asn.1 - the Larmouth book is free to
download. 
http://www.oss.com/asn1/resources/books-whitepapers-pubs/asn1-books.html#larmouth

If you have any other questions about reading/writing/Asn1 specs -
write me - chogue {at} blueprint.org and I can probably answer most of
them, as I still use Asn.1 for my 3D structure research.

Also - the NCBIC++ toolkit datatool apparently has some support now
for converting Asn.1 into JSON. I haven't tested the extent of it, but
it looks interesting for simple types.

Cheers,
Christopher Hogue
www.blueprint.org




<quote author='Peter Cock'>
...
P.S. Does anyone know an authoritative source for the MIME
types used by the NCBI? Using the BLAST website they
offer plain text ASN.1 just as text/plain, likewise efetch also
seems to use text/plain for ASN.1 downloads. However I've
seen references to chemical/ncbi-asn1-ascii and
chemical/ncbi-asn1-binary mime-types mentioned, e.g.
http://www.ncbi.nlm.nih.gov/data_specs/asn/NCBI_all.asn

i.e. It appears that 3D structure NCBI ASN.1 files use
a well defined MIME type, while most NCBI ASN.1 text
files default to text/plain - which we can handle nicely in
Galaxy as subclasses.
___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

  http://lists.bx.psu.edu/

</quote>
Quoted from:
http://dev.list.galaxyproject.org/ASN-1-text-and-binary-formats-in-Galaxy-Tool-Shed-tp4658555.html


_____________________________________
Sent from http://dev.list.galaxyproject.org
___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  http://lists.bx.psu.edu/

To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/mailinglists/

Reply via email to