Re: [base] some base 2 bugs/features (raw data and array design import)

Nicklas Nordborg Mon, 23 Apr 2007 11:50:35 -0700

Bob MacCallum wrote:
> Hi,
> 
> I just spent the afternoon getting to know the array design and raw data
> import into BASE2 - starting with genepix format - and have come across a few
> things.
> 
> I'm using BASE 2.2.2 (build #3172; schema #30).  I have looked through the
> fixes for 2.2.3 and decided not to upgrade - otherwise I'll just spend my
> whole life upgrading BASE... ;-)
> 
> 1. In the "view files" page, the "type" menu has a blank entry for 'raw data'
>    although this still seems to work.  This might be fixed in 2.2.3
>    see http://base.thep.lu.se/ticket/559 which looks related.


Yes, I think this is the same thing.


> 2. I think there's some inconsistent handling of trailing spaces in the
>    "reporter ID" column of a genepix .gpr file.  For example I can import
>    reporters, and create an array design from the file pasted below, but I
>    can't then import the raw data!
> 
> (the following is just 8 lines long - if the long lines get mangled, I'll send
> a copy by mail on request)
> 
> ATF   1.0
> 27    43    
> Type=GenePix Results 1.4
> "Block"       "Column"        "Row"   "Name"  "ID"    "X"     "Y"     "Dia."  
> "F635 Median"   "F635 Mean"     "F635 SD"       "B635 Median"   "B635 Mean"   
>   "B635 SD"       "% > B635+1SD"  "% > B635+2SD"  "F635 % Sat."   "F532 
> Median"   "F532 Mean"     "F532 SD"       "B532 Median"   "B532 Mean"     
> "B532 SD"       "% > B532+1SD"  "% > B532+2SD"  "F532 % Sat."   "Ratio of 
> Medians"      "Ratio of Means"        "Median of Ratios"      "Mean of 
> Ratios"        "Ratios SD"     "Rgn Ratio"     "Rgn R²"        "F Pixels"     
>  "B Pixels"      "Sum of Medians"        "Sum of Means"  "Log Ratio"     
> "F635 Median - B635"    "F532 Median - B532"    "F635 Mean - B635"      "F532 
> Mean - B532"      "Flags"
> 1     1       1       "demoA" "demorep1"      1690    5730    110     183     
> 181     42      59      62      25      100     98      0       276     270   
>   48      64      65      13      100     100     0       0.585   0.592   
> 0.570   0.576   1.357   0.591   0.782   80      621     336     328     
> -0.774  124     212     122     206     0
> 1     2       1       "demoB" "demorep2 "     1910    5730    120     114     
> 137     175     57      61      37      71      21      0       346     341   
>   80      63      65      35      96      95      0       0.201   0.288   
> 0.192   0.209   2.379   0.398   0.094   120     716     340     358     
> -2.312  57      283     80      278     0
> 1     3       1       "demoC" "demorep3"      2110    5740    110     145     
> 148     43      63      68      30      92      68      0       208     214   
>   48      69      74      43      98      93      0       0.590   0.586   
> 0.599   0.541   1.987   0.504   0.582   80      566     221     230     
> -0.761  82      139     85      145     0
> 1     4       1       "demoD" "demorep4"      2300    5730    110     185     
> 187     51      59      63      23      100     96      0       298     294   
>   57      64      67      24      100     98      0       0.538   0.557   
> 0.526   0.538   1.599   0.549   0.730   80      590     360     358     
> -0.893  126     234     128     230     0
> 
> 
> the stacktrace from the raw data import is:
> 
> net.sf.basedb.core.BaseException: Item not found: Reporter mismatch: The 
> feature has reporter 'demorep2' whereas you have given 'demorep2 ' on line 6: 
> 1 2 1 "demoB" "de...
> at 
> net.sf.basedb.plugins.AbstractFlatFileImporter.doImport(AbstractFlatFileImporter.java:592)
> at 
> net.sf.basedb.plugins.AbstractFlatFileImporter.run(AbstractFlatFileImporter.java:442)
> at 
> net.sf.basedb.core.PluginExecutionRequest.invoke(PluginExecutionRequest.java:88)
> at 
> net.sf.basedb.core.InternalJobQueue$JobRunner.run(InternalJobQueue.java:420)
> at java.lang.Thread.run(Thread.java:619)
> Caused by: net.sf.basedb.core.ItemNotFoundException: Item not found: Reporter 
> mismatch: The feature has reporter 'demorep2' whereas you have given 
> 'demorep2 '
> at net.sf.basedb.core.RawDataBatcher.doInsert(RawDataBatcher.java:390)
> at net.sf.basedb.core.RawDataBatcher.insert(RawDataBatcher.java:343)
> at 
> net.sf.basedb.plugins.RawDataFlatFileImporter.handleData(RawDataFlatFileImporter.java:544)
> at 
> net.sf.basedb.plugins.AbstractFlatFileImporter.doImport(AbstractFlatFileImporter.java:570)
> ... 4 more
> 
> 
> I think BASE1 was more tolerant.

Leading and trailing blanks are trimmed from more or less all values 
before they are inserted in the database and that explains why you get 
"demorep2" instead of "demorep2 ". I guess we never though of doing the 
same when checking if a reporter (or something else with a unique value) 
exists in the database or not. I think there are several other places 
affected by the same thing. I'll add this as a bug in our trac database. 
In the meantime you can try using a splitter regexp that also removes 
white-space. Try something like \s*\t\s* instead of just \t. I have not 
tested this but it might be enough to make it work.

> 
> 3. case sensitivity in the reporter ID (external id) column
> 
>   I get "Error: Duplicate entry 'demoBLANK' for key 2"
>   if I import reporters from this file:
> 
> ATF   1.0
> 27    43    
> Type=GenePix Results 1.4
> "Block"       "Column"        "Row"   "Name"  "ID"    "X"     "Y"     "Dia."  
> "F635 Median"   "F635 Mean"     "F635 SD"       "B635 Median"   "B635 Mean"   
>   "B635 SD"       "% > B635+1SD"  "% > B635+2SD"  "F635 % Sat."   "F532 
> Median"   "F532 Mean"     "F532 SD"       "B532 Median"   "B532 Mean"     
> "B532 SD"       "% > B532+1SD"  "% > B532+2SD"  "F532 % Sat."   "Ratio of 
> Medians"      "Ratio of Means"        "Median of Ratios"      "Mean of 
> Ratios"        "Ratios SD"     "Rgn Ratio"     "Rgn R²"        "F Pixels"     
>  "B Pixels"      "Sum of Medians"        "Sum of Means"  "Log Ratio"     
> "F635 Median - B635"    "F532 Median - B532"    "F635 Mean - B635"      "F532 
> Mean - B532"      "Flags"
> 1     1       1       "demoA" "demorep1"      1690    5730    110     183     
> 181     42      59      62      25      100     98      0       276     270   
>   48      64      65      13      100     100     0       0.585   0.592   
> 0.570   0.576   1.357   0.591   0.782   80      621     336     328     
> -0.774  124     212     122     206     0
> 1     2       1       "demoB" "demorep2"      1910    5730    120     114     
> 137     175     57      61      37      71      21      0       346     341   
>   80      63      65      35      96      95      0       0.201   0.288   
> 0.192   0.209   2.379   0.398   0.094   120     716     340     358     
> -2.312  57      283     80      278     0
> 1     3       1       "demoblank"     "demoblank"     2110    5740    110     
> 145     148     43      63      68      30      92      68      0       208   
>   214     48      69      74      43      98      93      0       0.590   
> 0.586   0.599   0.541   1.987   0.504   0.582   80      566     221     230   
>   -0.761  82      139     85      145     0
> 1     4       1       "demoBLANK"     "demoBLANK"     2300    5730    110     
> 185     187     51      59      63      23      100     96      0       298   
>   294     57      64      67      24      100     98      0       0.538   
> 0.557   0.526   0.538   1.599   0.549   0.730   80      590     360     358   
>   -0.893  126     234     128     230     0
> 
> 
> However in BASE1 it was possible to import files with problems like this.
> For example, see
> 
> http://base.vectorbase.org/raw_edit.phtml?i_r=102 (just click ok to log in)
> 
> you can compare the imported .gpr file (scroll down to
> 2     17      20) with the table of data (position 857)
> 
> you see that "BLANK" was imported as "Blank" because "Blank" was already in 
> the
> table.
> 

This problem is affected how the database handles strings. MySQL is 
case-insensitive. Postgres on the other hand is case-sensitive and the 
same problem would never have appeared. The important question is if the 
"demoblank" and "demoBLANK" should be treated as the same reporters or not?

In Postgres they are already treated as different and it would be rather 
hard to change that. The only way is to convert all ID:s to the same 
case before storing them in the database.

In MySQL they are treated as the same and it is equally hard to change 
that. The problem appears here because the two reporters are in the same 
file. If there had been two different raw data files, both "demoblank" 
and "demoBLANK" would have mapped to the same reporter. The bug in our 
code is that when the lines are in the same file we do case-sensitive 
comparison to check what has already been inserted. I'll add a ticket 
for that as well.


> Tomorrow I'll see how far I get with fixing the input files.  Ideally I want
> to be able to continue to import raw data into BASE2 linked to array designs
> that were migrated from BASE1.  We have to fix all kinds of stuff in the files
> anyway so I don't think that should be too much of a problem.
> 
> 
> 4. It doesn't seem possible to "un-import" raw data (in order to
>    reimport it after fixing some annoying typo).  The same
>    seems to be true of array designs (can't reimport reporter maps).

No, it's not. There is functionality for it in the core but it hasn't 
made the web interface yet.

> 5. There doesn't seem to a record of which file was used to import features
>    into an array design (this has been discussed on the list recently I
>    think).

Yes, it was discussed for raw data import. The problem is that raw data 
doesn't always come in one file. Some platforms generate two files and 
some generate one file for multiple hybridizations. Maybe the same 
problem doesn't exist for features and there always is a one-to-one 
relationship to a file?

Thanks for the test data. If I have time I will cut out your lines and 
include them in the test programs. We have not done much testing with 
non-perfect data and I guess there is more to find before the problems 
disappear.

/Nicklas

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
The BASE general discussion mailing list
[email protected]
unsubscribe: send a mail with subject "unsubscribe" to
[EMAIL PROTECTED]

Re: [base] some base 2 bugs/features (raw data and array design import)

Reply via email to