[Pharo-users] PhyloclassTalk (was: Re: [squeak-dev] [ANN] BioSqueak 0.4)

Hernán Morales Durand Sat, 23 Feb 2013 11:53:46 -0800

Hello Hannes,

Sorry for the late response, I have been working intensively in anapplication using BioSmalltalk. Here is a post with some screenshots:http://biosmalltalk.blogspot.com.ar/2013/02/phyloclasstalk-preview.html

as I've said, it is developed in Pharo but most subsystems work inSqueak too. I cross-post to the Pharo users list in case someone isinterested.


El 16/02/2013 16:00, H. Hirzel escribió:

Hello Hernán

Thank you for your elaboration on the topic of BioSqueak.

On 2/1/13, Hernán Morales Durand <[email protected]> wrote:


Hello Hannes,
Thanks for the feedback! Some answers then between the lines:

El 01/02/2013 11:35, H. Hirzel escribió:

Hello Hernán

This is interesting.
http://biosmalltalk.blogspot.com/

I understand that you have constructed an internal domain specific
language (a DSL, a query language) for dealing with genetic data in
Smalltalk

search := BioNCBIWWWBlastClient new nucleotide query:
'CCCTCAAACAT...TTTGAGGAG';
     hitListSize: 150;
     filterLowComplexity;
     expectValue: 10;
     wordSize: 11;
     blastn;
     blastPlainService;
     alignmentViewFlatQueryAnchored;
     formatTypeXML;
     fetch.
search outputToFile: 'blast-query-result.xml' contents: search result.

Is there a description of this DSL?


Is not a DSL in the traditional sense, i.e., using ANTLR, Lex or Yacc,
but a "DSL" which is embedded thus inheriting the syntax and execution
semantics of Smalltalk.


Yes, I understand, the regular thing in Smalltalk as every Smalltalk
domain model could be considered a DSL to a certain extent/

Lukas Renggli has a useful classification on DSLs in his PhD dissertation on
    'Dynamic Language Embedding''
     http://scg.unibe.ch/archive/phd/renggli-phd.pdf
     Chapter 2

According to that you probably have an Internal DSL (chapter 2.1), right?

Yes, it would fit into the Internal DSL category. I didn't knew aboutthat classification, thanks for sharing.

To clarify: I've not built a DSL specification for the QBlast API,
although I'm willing to develop DSLs for bioinformatics APIs in a
Smalltalk language workbench (anyone?).

OK

Currently the messages for performing alignments at the NCBI are based
in the API specification,
http://www.ncbi.nlm.nih.gov/staff/tao/URLAPI/new/node9.html .


The unary

sends are the result of a plan to reduce parametrization and to
replicate or customize Blast settings through a UI. This is because
geneticists experiment changing Blast parameters over time and I want my
system not to be tied to textual parameters.

  > The data is kept in XML files and
  > all is read into the image to be queried. It seems that you don't have
  > a problem with the image size?

Yes I had problems with image size and performance, a lot indeed.

Actually working with XML DOM with alignments of 5000 or more hits
Squeak (and Pharo of course) started to show slowliness. So I cannot
keep all XML nodes in memory. To overcome this problem I've tried the
SAX (push) parser and the XMLPullParser (which is a StAX parser). Then
my idea was to reduce the tree by specifying only the XML nodes which
I'm interested for. After reducing the nodes, I wrote custom XML tree
classes with a specific API to query blast XML results, taken form the
DTD specification. AFAIK this is known as a XML digester, which is
somewhat "evolved" in Java
(http://commons.apache.org/digester/xmlrules.html).


I understand that you took
       http://www.squeaksource.com/XMLSupport/
       (the XML support repo for Pharo, for Squeak XML support is in
the trunk image)
and modified it.

I have built a
dynamic query builder in Morphic for querying the XML providing the
possibility of persist and update the filters. Unfortunately for Squeak
users I'm using the Polymorph API, which I think is not available in
Squeak.


A screen shot would be appreciated... :-)


Ok, the blog post includes some screenshots.

We worked using the XML push/pull parsers for reading genomes and they
worked acceptably. But it is impossible to keep nodes for 3 GBytes of
XML at least for now in Squeak/Pharo.


According to my experience keeping XML structures in the image is
inefficient in terms of memory usage. More efficient ways are needed
and XML is then only for reading/writing to external files.


Exactly, XML is not good at all for big data.

More and critical problems arise when trying to work with microarray
data (big data) in Smalltalk which is not document-oriented. I had to
switch to "solutions" like SQL, or HDF5 using Pytables with
well-designed scheme for our input. The advantages are that supports
indexing and reading data in blocks, besides tools like Vitables or
HDFView to navigate the data. Until someone provides some bits in this
field, there is little opportunity for using Smalltalk.


But what I understand is that people keep DNA data in memory for speed
reasons and use C++ or Perl programs to deal with it.

It really depends of the type of analysis, I've seen most starterbioinformaticians prefer Python over Perl because of the nicer syntaxand more complete library support.

I don't know big data projects using C++ with raw DNA data. Compressionwith indexing, and specialized file formats are used these days,splitting data in clusters where needed. I would love to see someSmalltalkers working on dataspaces too.


See these presentations: http://www.slideshare.net/mndoci/presentations

I would welcome a short writeup with a general introduction to what
you are doing in http://biosmalltalk.blogspot.com/.


We have submitted a paper recently and we are waiting for the review
results. On the other side we are preparing another paper for a
phylogenetics decision support system which includes text-mining and a
rule engine. I will try to write an entry in the next week with
screenshots.


Any news on this?


No news so far, still in the reviewing process.
Best regards,

Hernán

Kind regards
Hannes

Best regards,

Hernán

Kind regards

Hannes Hirzel

On 2/1/13, Hernán Morales Durand <[email protected]> wrote:

Hi,

Few days ago I created a port of BioSmalltalk for Squeak too.
BioSmalltalk is a library for doing Bioinformatics with Smalltalk. This
port is labelled "BioSqueak" and I expect to release a version for
Windows sometime soon. You can find it in:

http://code.google.com/p/biosmalltalk/downloads/list

I'm very interested in feedback.
Thanks for reading.

Hernán

--
Hernán Morales
Institute of Veterinary Genetics (IGEVET)
http://igevet.fcv.unlp.edu.ar
National Scientific and Technical Research Council (CONICET).
La Plata (1900), Buenos Aires, Argentina.
Telephone: +54 (0221) 421-1799.
Internal: 422
Fax: 425-7980 or 421-1799.

[Pharo-users] PhyloclassTalk (was: Re: [squeak-dev] [ANN] BioSqueak 0.4)

Reply via email to