Re: [base] BASE 2 & Illumina arrays

Nicklas Nordborg Wed, 01 Aug 2007 23:19:42 -0700

Jeremy Davis-Turak wrote:
>>> 1)  Annotation data: CSV file.  It's too bad that it's a CSV, because
>>> some of the fields contain commas!
>> Hmmm... it looks hard to import this with existing importers. I'll have
>> to start with the raw data import and leave this until later.
>>
> 
> Yeah, what I did was just open it in excel and save it as a .txt file.
>  Not ideal, but the easiest way so far.


I searched the net and found a regexp that can split this kind of data 
properly: ,(?=(?:[^"]*"[^"]*")*(?![^"]*"))

I have not tested it with the annotation file yet, but if it works I'll 
add it as a preset in the "Test with file" function.

(Thanks to Raimond Brookman
http://blogs.infosupport.com/raimondb/archive/2005/04/27/199.aspx)

>> Do we have to worry about messed up files? For example, if there is
>> AVG_Signal and BEAD_STDEV columns for one data set but only AVG_Signal
>> for another?
> 
> I haven't encountered any messed up files yet.  However, I think it
> would be easy to catch them, since you would be parsing the headers
> anyway.  I don't know if other people have files like that, which they
> wish to import, but for me, I would want the plugin to throw an error
> in that case.

I'll include a check for this.


>> We could simply stop there and let the users revert to manual work if
>> the needed to connect the imported raw bioassays with scans, array
>> designs and experiments, but I think we can do a little bit more. I just
>> have a few questions.
>>
>> Should all raw bioassays be associated with a single scan (and thus the
>> same hybridization) or do we need to associate the raw bioassys from
>> each chip with separate scan and hybridization?
>>
> 
> I'll get back to you later to confirm this, but I believe I made one
> hybridization and one scan.  For us, this made most sense because it
> models what actually goes on.  I don't know what other groups prefer,
> or if they require any functionality that is lost by having only one
> hyb.

I'll go for the simple solution in the first version, which is to let 
the user select one scan, one protocol and one software that is 
associated with all raw bioassays.


>> It is difficult to associate the raw bioassays with array designs, since
>> there are no spot coordinates in the file. We could fake this and use
>> block=1, column=1 and row=row number in file. The benefit is that
>> analysis will behave better if all raw bioassays are associated with the
>> same array design. The drawback is that we must also fake the array
>> design in the same way. It should be possible to use the existing
>> ReporterMapImporter for this if we feed it the same raw data file.
>>
> 
> We don't actually use array designs at this point, so I'm not sure how
> to address this.  Faking it sounds fine to me.

I'll skip this part for now. If there is time over when the rest of the 
functionality is implemented I might give it another shot.

> However, as a side note, the reason there is no spot info is that for
> Illumina, each array on  each chip is different!  The scanning
> software reads in a set of files which contain the array designs, and
> spits out the "gene_profile.csv" file, which is actually the data
> AVERAGED over all the beads for each probe.  So, if someone REALLY
> wanted to get into the deeper level of analysis (bead-level), they
> would have to upload some additional files (which I've never dealt
> with).  Thus, I recommend not dealing with that layer just yet.

Ok, this sounds almost like the same setup as for Affymetrix files. We 
have made a special solution which stores the data in the original files 
and not in the database. In BASE 2.5 we hope to create a generic 
solution for this.

> 
>> I am also thinking of the possibility of using the plug-in from the
>> Experiment view page if the experiment is of the 'illumina' data type.
>> Then, the raw bioassays created by the import could be assigned to the
>> experiment by the plug-in, saving yet another manual step.
>>
> 
> That seems cool.  Would it be then easy to extend this feature to all
> data types?

It would not be as useful since only one data set can be imported at a time.

>> Since you uploaded the files to out Trac I assume that you are not
>> worried about other users seeing them. Is it ok to use some of the files
>> in our regular test programs? They will not be included in the binary
>> distribution, only in the source distribution and of course from direct
>> subversion access.
> 
> Yes, you can use those data in your test files.

Thanks.

/Nicklas

-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems?  Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >>  http://get.splunk.com/
_______________________________________________
The BASE general discussion mailing list
basedb-users@lists.sourceforge.net
unsubscribe: send a mail with subject "unsubscribe" to
[EMAIL PROTECTED]

Re: [base] BASE 2 & Illumina arrays

Reply via email to