Re: [base] some base 2 bugs/features (raw data and array design import)

Bob MacCallum Tue, 24 Apr 2007 04:04:25 -0700

Hi Nicklas,

Thanks for the reply.


Nicklas Nordborg writes:
 > > 2. I think there's some inconsistent handling of trailing spaces in the
 > >    "reporter ID" column of a genepix .gpr file.  For example I can import
 > >    reporters, and create an array design from the file pasted below, but I
 > >    can't then import the raw data!
[snip]
 > Leading and trailing blanks are trimmed from more or less all values 
 > before they are inserted in the database and that explains why you get 
 > "demorep2" instead of "demorep2 ". I guess we never though of doing the 
 > same when checking if a reporter (or something else with a unique value) 
 > exists in the database or not. I think there are several other places 
 > affected by the same thing. I'll add this as a bug in our trac database. 
 > In the meantime you can try using a splitter regexp that also removes 
 > white-space. Try something like \s*\t\s* instead of just \t. I have not 
 > tested this but it might be enough to make it work.

I guessed there would be a neat trick like this, but couldn't think of it last
night.

However I tried \s*\t\s*
and "?\s*\t\s*"? which also needs Block = \"Block\ and Flags = \Flags"\

and they both oddly give the same error as before: 
Error: Item not found: Reporter mismatch: The feature has reporter 'demorep2'
whereas you have given 'demorep2 ' on line 6: 1 2 1 "demoB" "de...

I don't have time to look into the code today I'm afraid.


If I remove the trailing space from the GPR files I can import them against an
array design imported from BASE1 which was created using a buggy file (with
trailing spaces).  (I have to fix the case issues too - see below.)

So I'm happy.

 > > 
 > > 3. case sensitivity in the reporter ID (external id) column
 > > 
 > >   I get "Error: Duplicate entry 'demoBLANK' for key 2"
 > >   if I import reporters from this file:
[snip]
 > This problem is affected how the database handles strings. MySQL is 
 > case-insensitive. Postgres on the other hand is case-sensitive and the 
 > same problem would never have appeared. The important question is if the 
 > "demoblank" and "demoBLANK" should be treated as the same reporters or not?
 > 
 > In Postgres they are already treated as different and it would be rather 
 > hard to change that. The only way is to convert all ID:s to the same 
 > case before storing them in the database.
 > 
 > In MySQL they are treated as the same and it is equally hard to change 
 > that. The problem appears here because the two reporters are in the same 
 > file. If there had been two different raw data files, both "demoblank" 
 > and "demoBLANK" would have mapped to the same reporter. The bug in our 
 > code is that when the lines are in the same file we do case-sensitive 
 > comparison to check what has already been inserted. I'll add a ticket 
 > for that as well.

thanks!

 > 
 > > Tomorrow I'll see how far I get with fixing the input files.  Ideally I
 > > want to be able to continue to import raw data into BASE2 linked to array
 > > designs that were migrated from BASE1.  We have to fix all kinds of stuff
 > > in the files anyway so I don't think that should be too much of a
 > > problem.

for now I've fixed the input files (BLANK -> Blank and soap -> Soap)
and all is well.

 > > 5. There doesn't seem to a record of which file was used to import
 > > features into an array design (this has been discussed on the list
 > > recently I think).
 > 
 > Yes, it was discussed for raw data import. The problem is that raw data 
 > doesn't always come in one file. Some platforms generate two files and 
 > some generate one file for multiple hybridizations. Maybe the same 
 > problem doesn't exist for features and there always is a one-to-one 
 > relationship to a file?

to my knowledge array designs are usually in one file, but I've only worked
with GAL and ADF files and two platforms so far.

 > Thanks for the test data. If I have time I will cut out your lines and 
 > include them in the test programs. We have not done much testing with 
 > non-perfect data and I guess there is more to find before the problems 
 > disappear.

Haha, all we have is non-perfect data ;-)


I just found a minor bug with the web interface while importing raw data:
http://base.thep.lu.se/ticket/576
(don't remember that being fixed for 2.2.3)

Time to balance the negative with some positive...  BASE2 is so much nicer to
work with than BASE1, keep up the great work guys!!

Now I'll finally have a look at Micha's bulk importer...

cheers,
Bob.


-- 
Bob MacCallum | VectorBase Developer | Kafatos/Christophides Groups |
Division of Cell and Molecular Biology | Imperial College London |
Phone +442075941945 | Email [EMAIL PROTECTED]

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
The BASE general discussion mailing list
[email protected]
unsubscribe: send a mail with subject "unsubscribe" to
[EMAIL PROTECTED]

Re: [base] some base 2 bugs/features (raw data and array design import)

Reply via email to