On 8/1/07, Nicklas Nordborg <[EMAIL PROTECTED]> wrote: > Jeremy Davis-Turak wrote: > > Hi Nicklas, > > > > I uploaded some data files to the Ticket. > > Thanks a lot! That was exactly what we needed. > > > > > Here is a brief summary of what the data looks like: > > > > 1) Annotation data: CSV file. It's too bad that it's a CSV, because > > some of the fields contain commas! > > Hmmm... it looks hard to import this with existing importers. I'll have > to start with the raw data import and leave this until later. >
Yeah, what I did was just open it in excel and save it as a .txt file. Not ideal, but the easiest way so far. > > 2) Data: (header is on ~ line 8) > > a) For each set of chips that are processed at the same time, there is > > one resulting file. Thus, if you did two rat chips (each of which has > > 12 arrays on them), you would have 24 arrays contained in one file. > > b) Depending on the settings of the software at the time of scanning, > > you can have somewhere from 1-8 data columns per array (I don't know > > the exact range, but I know that it's variable). > > c) The first column contains the probe IDs, the rest of them are data. > > d) Each data column name is a concatenation of 3 things: > > i) The data type (i.e. 'AVG_Signal' or 'BEAD_STDEV') > > ii) The chip number (10 digits) > > iii) A capital letter indicating the position of the array on the > > chip (i.e. A-F for human, A-H for mouse, or A-L for rat.) > > EXAMPLE: the first 8 columns in my rat file are: > > I should be rather easy to create the raw bioassays. Once we have found > the column headers, we can extract the chip number and the capital > letter and use as name for the raw bioassays. The remaining parts of the > headers should be easy to map to raw data properties (since you have > already done this in the raw-data-types.xml for us). > > Do we have to worry about messed up files? For example, if there is > AVG_Signal and BEAD_STDEV columns for one data set but only AVG_Signal > for another? I haven't encountered any messed up files yet. However, I think it would be easy to catch them, since you would be parsing the headers anyway. I don't know if other people have files like that, which they wish to import, but for me, I would want the plugin to throw an error in that case. > > We could simply stop there and let the users revert to manual work if > the needed to connect the imported raw bioassays with scans, array > designs and experiments, but I think we can do a little bit more. I just > have a few questions. > > Should all raw bioassays be associated with a single scan (and thus the > same hybridization) or do we need to associate the raw bioassys from > each chip with separate scan and hybridization? > I'll get back to you later to confirm this, but I believe I made one hybridization and one scan. For us, this made most sense because it models what actually goes on. I don't know what other groups prefer, or if they require any functionality that is lost by having only one hyb. > It is difficult to associate the raw bioassays with array designs, since > there are no spot coordinates in the file. We could fake this and use > block=1, column=1 and row=row number in file. The benefit is that > analysis will behave better if all raw bioassays are associated with the > same array design. The drawback is that we must also fake the array > design in the same way. It should be possible to use the existing > ReporterMapImporter for this if we feed it the same raw data file. > We don't actually use array designs at this point, so I'm not sure how to address this. Faking it sounds fine to me. However, as a side note, the reason there is no spot info is that for Illumina, each array on each chip is different! The scanning software reads in a set of files which contain the array designs, and spits out the "gene_profile.csv" file, which is actually the data AVERAGED over all the beads for each probe. So, if someone REALLY wanted to get into the deeper level of analysis (bead-level), they would have to upload some additional files (which I've never dealt with). Thus, I recommend not dealing with that layer just yet. > I am also thinking of the possibility of using the plug-in from the > Experiment view page if the experiment is of the 'illumina' data type. > Then, the raw bioassays created by the import could be assigned to the > experiment by the plug-in, saving yet another manual step. > That seems cool. Would it be then easy to extend this feature to all data types? > > > > Thanks for making this plugin! > > Well... it is not implemented yet... > > >> The files will be put in a protected repository that is only > >> available to the core developers. > > Since you uploaded the files to out Trac I assume that you are not > worried about other users seeing them. Is it ok to use some of the files > in our regular test programs? They will not be included in the binary > distribution, only in the source distribution and of course from direct > subversion access. Yes, you can use those data in your test files. > > /Nicklas > > Jeremy ------------------------------------------------------------------------- This SF.net email is sponsored by: Splunk Inc. Still grepping through log files to find problems? Stop. Now Search log events and configuration files using AJAX and a browser. Download your FREE copy of Splunk now >> http://get.splunk.com/ _______________________________________________ The BASE general discussion mailing list basedb-users@lists.sourceforge.net unsubscribe: send a mail with subject "unsubscribe" to [EMAIL PROTECTED]