Re: [galaxy-dev] making data libraries from database tables

Greg Von Kuster Tue, 08 Nov 2011 19:42:31 -0800

James,

In addition to Ross's comments, consider that the base class of all Galaxy 
datasets ( library datasets, etc ) is the DatasetInstance class, whose 
get_file_name() method leverages the file_name property from the Dataset class 
( see ~/lib/galaxy/model/__init__.py ).  This file_name property is the pointer 
to the disk file for all Galaxy datasets that are opened for reading.  At first 
glance, I don't see a lot of calls to open these files in the Galaxy library 
framework, but it may still be problematic to handle even a few.  I believe 
most of these calls are in the Galaxy job and job related components ( metadata 
setting, etc ).


As Ross suggests, it may be a better approach to consider using the Galaxy API 
to translate your db queries into actual Galaxy data library dataset files if 
that would work for you.

On Nov 8, 2011, at 10:16 PM, Ross wrote:

> Hi James,
> 
> Existing tools mostly take file paths. This has arguably useful side
> effects in isolating cluster node execution from the galaxy server
> process and persisting the computation input on a file-system -
> arguably more stable long term than queries on dynamic database tables
> - and maybe not somewhere you want or need to invest a lot of effort.
> 
> Depends what your users need to make use of existing Galaxy tools - eg
> a specific experiment's sequence data in (eg) fastq format? Take a
> look under scripts/api - maybe you could write scripted queries into
> galaxy libraries automagically. Alternatively, if you want users to
> script their own extractions, not hard to write a new Galaxy tool that
> writes a new (eg) fastq file based on a database query with parameters
> supplied by the user. That fastq file appears in a history where it is
> then available for all existing Galaxy fastq tools and is a shareable
> persistent object for replicable analysis.
> 
> On Tue, Nov 8, 2011 at 9:39 PM, James Ireland <[email protected]> 
> wrote:
>> Hi Greg,
>> 
>> So, here are my concerns:
>> 1.  From looking through some of the source it *appears* to me that the raw
>> data input calls are spread across the various libraries as standard file IO
>> calls.  So, if I wanted to use my db underneath I'd need to replace/catch
>> all of these.  I was hoping that there would be fewer points of
>> customization required.
>> 
>> 2.  Even solving (1), when tools are called from Galaxy like the SAM tools,
>> ClustalW, etc I am assuming that behind the scenes these apps are being
>> passed file paths.  I know that's how I've wrapped my own tools in Galaxy.
>> So, I would need to instantiate my data to file at that point.  That would
>> mean adding some more special sauce to catch whenever a file path is being
>> passed out to a tool and make sure the file gets created first.
>> 
>> My HUGE caveat is that I still haven't spent much time with the source so I
>> could be way off on these concerns - but this is my impression.  I'd welcome
>> enlightenment if I'm wrong!
>> 
>> Thanks,
>>  -J
>> 
>> 
>> On Tue, Nov 8, 2011 at 3:13 PM, Greg Von Kuster <[email protected]> wrote:
>>> 
>>> Hi James,
>>> I haven't gone too far down the implementation path in this area, so I'm
>>> certainly not aware of the issues you may be discovering.  The key would be
>>> to implement a layer on top of your database so that Galaxy's data library
>>> upload component can treat the data contained in your database just like it
>>> treats the content of a file on the file system.  Since Galaxy must open and
>>> read data files stored on the file system in order to use them as input to
>>> Galaxy tools, it should be able to do the same for data made available from
>>> a database table (I would assume, but again, I'm not completely sure of the
>>> potential issues).  The data files resulting from the execution of these
>>> Galaxy tools would of course be files on the file system within the Galaxy
>>> default file store.
>>> By "external tools" do you mean tools that are not a part of the Galaxy
>>> instance?
>>> 
>>> On Nov 8, 2011, at 5:14 PM, James Ireland wrote:
>>> 
>>> Hi Greg,
>>> 
>>> Did more digging around today in the Galaxy source and maybe I misjudged
>>> the situation.  Although getting a representation of my datasets into Galaxy
>>> appears relatively straightforward, at the end of the day reads of raw data
>>> and passing data to and from external tools, etc all assumes the data is
>>> sitting in a file, correct?
>>> 
>>> Thanks again,
>>>  -J
>>> 
>>> 
>>> On Mon, Nov 7, 2011 at 6:29 PM, Greg Von Kuster <[email protected]> wrote:
>>>> 
>>>> Hi James,
>>>> Since genomic data files are often very large, Galaxy does not store them
>>>> in a database, so this specific scenario has not been implemented as far as
>>>> I know.  However, you may be able to implement what you've described 
>>>> without
>>>> too much difficulty.  If you could implement a layer on top of your 
>>>> database
>>>> that leverages Galaxy's features for uploading a directory of files or file
>>>> system paths (maybe better in this case) without copying the data into
>>>> Galaxy's default file store, it should be fairly trivial to make Galaxy 
>>>> work
>>>> with it.  Using this combination, Galaxy will read the data (without making
>>>> any changes to it) in order to generate metadata associated with the data.
>>>>  The metadata is stored separately from the raw data.
>>>> I was at the Pac Bio meeting, so we definitely met there.  Good to hear
>>>> from you!
>>>> On Nov 7, 2011, at 8:58 PM, James Ireland wrote:
>>>> 
>>>> Hi Greg,
>>>> 
>>>> Thanks for the fast response!  I think we might have met last year at the
>>>> PacBio 3rd party software vendor meeting.
>>>> 
>>>> So, I had seen the documents for the data repository and the"Uploading a
>>>> Directory of Files" with the "Copy data into Galaxy?" option de-selected
>>>> seems the closest analog to what I want to do.  In my complete and utterly
>>>> naive understanding of how Galaxy works, if I could wrap my data repository
>>>> (in this case, my db) with the same sort of functionality as a file
>>>> directory (scan, load, etc) then I would guess that the integration 
>>>> wouldn't
>>>> be that painful.  Obviously, this would require custom development.  This 
>>>> is
>>>> important enough to my company that we'd be willing to work on doing this -
>>>> but I'm guessing I'm way off base?
>>>> 
>>>> This seems like it would be a fairly common request - to your knowledge,
>>>> has anyone outside Galaxy rolled their own solution along these lines?
>>>> 
>>>> Thanks again,
>>>> -J
>>>> 
>>>> 
>>>> On Mon, Nov 7, 2011 at 11:41 AM, Greg Von Kuster <[email protected]> wrote:
>>>>> 
>>>>> Hello James,
>>>>> This is not currently possible - the options for uploading files to
>>>>> Galaxy data libraries is documented in our wiki
>>>>> at 
>>>>> http://wiki.g2.bx.psu.edu/Admin/Data%20Libraries/Uploading%20Library%20Files
>>>>> On Nov 7, 2011, at 2:11 PM, James Ireland wrote:
>>>>> 
>>>>> Greetings!
>>>>> 
>>>>> I would like to expose data I have in a relational database as a data
>>>>> library in Galaxy.  I would really like to do this without Galaxy having 
>>>>> to
>>>>> make a local copy of the data to the file system.  Is this possible and
>>>>> could you point me to any code examples and/or documentation?
>>>>> 
>>>>> I'm sure this must be covered somewhere in the documentation or mailing
>>>>> list, but I haven't been able to find it.
>>>>> 
>>>>> Thanks for your help!
>>>>> 
>>>>> -James
>>>>> --
>>>>> J Ireland
>>>>> www.5amsolutions.com | Software for Life(TM)
>>>>> m: 415 484-DATA (3282)
>>>>> ___________________________________________________________
> 
> ___________________________________________________________
> Please keep all replies on the list by using "reply all"
> in your mail client.  To manage your subscriptions to this
> and other Galaxy lists, please use the interface at:
> 
>  http://lists.bx.psu.edu/
> 

Greg Von Kuster
Galaxy Development Team
[email protected]

___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

  http://lists.bx.psu.edu/

Re: [galaxy-dev] making data libraries from database tables

Reply via email to