Re: [base] BASE 2 & Illumina arrays

Jeremy Davis-Turak Wed, 01 Aug 2007 12:40:26 -0700

On 8/1/07, Nicklas Nordborg <[EMAIL PROTECTED]> wrote:
> Jeremy Davis-Turak wrote:
> > Hi Nicklas,
> >
> > I uploaded some data files to the Ticket.
>
> Thanks a lot! That was exactly what we needed.
>
> >
> > Here is a brief summary of what the data looks like:
> >
> > 1)  Annotation data: CSV file.  It's too bad that it's a CSV, because
> > some of the fields contain commas!
>
> Hmmm... it looks hard to import this with existing importers. I'll have
> to start with the raw data import and leave this until later.
>


Yeah, what I did was just open it in excel and save it as a .txt file.
 Not ideal, but the easiest way so far.


> > 2)  Data: (header is on ~ line 8)
> > a) For each set of chips that are processed at the same time, there is
> > one resulting file.  Thus, if you did two rat chips (each of which has
> > 12 arrays on them), you would have 24 arrays contained in one file.
> > b) Depending on the settings of the software at the time of scanning,
> > you can have somewhere from 1-8 data columns per array (I don't know
> > the exact range, but I know that it's variable).
> > c)  The first column contains the probe IDs, the rest of them are data.
> > d) Each data column name is a concatenation of 3 things:
> >    i)  The data type (i.e. 'AVG_Signal' or 'BEAD_STDEV')
> >   ii) The chip number (10 digits)
> >    iii) A capital letter indicating the position of the array on the
> > chip (i.e. A-F for human, A-H for mouse, or A-L  for rat.)
> >    EXAMPLE: the first 8 columns in my rat file are:
>
> I should be rather easy to create the raw bioassays. Once we have found
> the column headers, we can extract the chip number and the capital
> letter and use as name for the raw bioassays. The remaining parts of the
>   headers should be easy to map to raw data properties (since you have
> already done this in the raw-data-types.xml for us).
>
> Do we have to worry about messed up files? For example, if there is
> AVG_Signal and BEAD_STDEV columns for one data set but only AVG_Signal
> for another?

I haven't encountered any messed up files yet.  However, I think it
would be easy to catch them, since you would be parsing the headers
anyway.  I don't know if other people have files like that, which they
wish to import, but for me, I would want the plugin to throw an error
in that case.

>
> We could simply stop there and let the users revert to manual work if
> the needed to connect the imported raw bioassays with scans, array
> designs and experiments, but I think we can do a little bit more. I just
> have a few questions.
>
> Should all raw bioassays be associated with a single scan (and thus the
> same hybridization) or do we need to associate the raw bioassys from
> each chip with separate scan and hybridization?
>

I'll get back to you later to confirm this, but I believe I made one
hybridization and one scan.  For us, this made most sense because it
models what actually goes on.  I don't know what other groups prefer,
or if they require any functionality that is lost by having only one
hyb.

> It is difficult to associate the raw bioassays with array designs, since
> there are no spot coordinates in the file. We could fake this and use
> block=1, column=1 and row=row number in file. The benefit is that
> analysis will behave better if all raw bioassays are associated with the
> same array design. The drawback is that we must also fake the array
> design in the same way. It should be possible to use the existing
> ReporterMapImporter for this if we feed it the same raw data file.
>

We don't actually use array designs at this point, so I'm not sure how
to address this.  Faking it sounds fine to me.

However, as a side note, the reason there is no spot info is that for
Illumina, each array on  each chip is different!  The scanning
software reads in a set of files which contain the array designs, and
spits out the "gene_profile.csv" file, which is actually the data
AVERAGED over all the beads for each probe.  So, if someone REALLY
wanted to get into the deeper level of analysis (bead-level), they
would have to upload some additional files (which I've never dealt
with).  Thus, I recommend not dealing with that layer just yet.

> I am also thinking of the possibility of using the plug-in from the
> Experiment view page if the experiment is of the 'illumina' data type.
> Then, the raw bioassays created by the import could be assigned to the
> experiment by the plug-in, saving yet another manual step.
>

That seems cool.  Would it be then easy to extend this feature to all
data types?

> >
> > Thanks for making this plugin!
>
> Well... it is not implemented yet...
>
> >> The files will be put in a protected repository that is only
> >> available to the core developers.
>
> Since you uploaded the files to out Trac I assume that you are not
> worried about other users seeing them. Is it ok to use some of the files
> in our regular test programs? They will not be included in the binary
> distribution, only in the source distribution and of course from direct
> subversion access.

Yes, you can use those data in your test files.

>
> /Nicklas
>
>

Jeremy

-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems?  Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >>  http://get.splunk.com/
_______________________________________________
The BASE general discussion mailing list
basedb-users@lists.sourceforge.net
unsubscribe: send a mail with subject "unsubscribe" to
[EMAIL PROTECTED]

Re: [base] BASE 2 & Illumina arrays

Reply via email to