Bob MacCallum wrote: > Hi, > > I just spent the afternoon getting to know the array design and raw data > import into BASE2 - starting with genepix format - and have come across a few > things. > > I'm using BASE 2.2.2 (build #3172; schema #30). I have looked through the > fixes for 2.2.3 and decided not to upgrade - otherwise I'll just spend my > whole life upgrading BASE... ;-) > > 1. In the "view files" page, the "type" menu has a blank entry for 'raw data' > although this still seems to work. This might be fixed in 2.2.3 > see http://base.thep.lu.se/ticket/559 which looks related.
Yes, I think this is the same thing. > 2. I think there's some inconsistent handling of trailing spaces in the > "reporter ID" column of a genepix .gpr file. For example I can import > reporters, and create an array design from the file pasted below, but I > can't then import the raw data! > > (the following is just 8 lines long - if the long lines get mangled, I'll send > a copy by mail on request) > > ATF 1.0 > 27 43 > Type=GenePix Results 1.4 > "Block" "Column" "Row" "Name" "ID" "X" "Y" "Dia." > "F635 Median" "F635 Mean" "F635 SD" "B635 Median" "B635 Mean" > "B635 SD" "% > B635+1SD" "% > B635+2SD" "F635 % Sat." "F532 > Median" "F532 Mean" "F532 SD" "B532 Median" "B532 Mean" > "B532 SD" "% > B532+1SD" "% > B532+2SD" "F532 % Sat." "Ratio of > Medians" "Ratio of Means" "Median of Ratios" "Mean of > Ratios" "Ratios SD" "Rgn Ratio" "Rgn R²" "F Pixels" > "B Pixels" "Sum of Medians" "Sum of Means" "Log Ratio" > "F635 Median - B635" "F532 Median - B532" "F635 Mean - B635" "F532 > Mean - B532" "Flags" > 1 1 1 "demoA" "demorep1" 1690 5730 110 183 > 181 42 59 62 25 100 98 0 276 270 > 48 64 65 13 100 100 0 0.585 0.592 > 0.570 0.576 1.357 0.591 0.782 80 621 336 328 > -0.774 124 212 122 206 0 > 1 2 1 "demoB" "demorep2 " 1910 5730 120 114 > 137 175 57 61 37 71 21 0 346 341 > 80 63 65 35 96 95 0 0.201 0.288 > 0.192 0.209 2.379 0.398 0.094 120 716 340 358 > -2.312 57 283 80 278 0 > 1 3 1 "demoC" "demorep3" 2110 5740 110 145 > 148 43 63 68 30 92 68 0 208 214 > 48 69 74 43 98 93 0 0.590 0.586 > 0.599 0.541 1.987 0.504 0.582 80 566 221 230 > -0.761 82 139 85 145 0 > 1 4 1 "demoD" "demorep4" 2300 5730 110 185 > 187 51 59 63 23 100 96 0 298 294 > 57 64 67 24 100 98 0 0.538 0.557 > 0.526 0.538 1.599 0.549 0.730 80 590 360 358 > -0.893 126 234 128 230 0 > > > the stacktrace from the raw data import is: > > net.sf.basedb.core.BaseException: Item not found: Reporter mismatch: The > feature has reporter 'demorep2' whereas you have given 'demorep2 ' on line 6: > 1 2 1 "demoB" "de... > at > net.sf.basedb.plugins.AbstractFlatFileImporter.doImport(AbstractFlatFileImporter.java:592) > at > net.sf.basedb.plugins.AbstractFlatFileImporter.run(AbstractFlatFileImporter.java:442) > at > net.sf.basedb.core.PluginExecutionRequest.invoke(PluginExecutionRequest.java:88) > at > net.sf.basedb.core.InternalJobQueue$JobRunner.run(InternalJobQueue.java:420) > at java.lang.Thread.run(Thread.java:619) > Caused by: net.sf.basedb.core.ItemNotFoundException: Item not found: Reporter > mismatch: The feature has reporter 'demorep2' whereas you have given > 'demorep2 ' > at net.sf.basedb.core.RawDataBatcher.doInsert(RawDataBatcher.java:390) > at net.sf.basedb.core.RawDataBatcher.insert(RawDataBatcher.java:343) > at > net.sf.basedb.plugins.RawDataFlatFileImporter.handleData(RawDataFlatFileImporter.java:544) > at > net.sf.basedb.plugins.AbstractFlatFileImporter.doImport(AbstractFlatFileImporter.java:570) > ... 4 more > > > I think BASE1 was more tolerant. Leading and trailing blanks are trimmed from more or less all values before they are inserted in the database and that explains why you get "demorep2" instead of "demorep2 ". I guess we never though of doing the same when checking if a reporter (or something else with a unique value) exists in the database or not. I think there are several other places affected by the same thing. I'll add this as a bug in our trac database. In the meantime you can try using a splitter regexp that also removes white-space. Try something like \s*\t\s* instead of just \t. I have not tested this but it might be enough to make it work. > > 3. case sensitivity in the reporter ID (external id) column > > I get "Error: Duplicate entry 'demoBLANK' for key 2" > if I import reporters from this file: > > ATF 1.0 > 27 43 > Type=GenePix Results 1.4 > "Block" "Column" "Row" "Name" "ID" "X" "Y" "Dia." > "F635 Median" "F635 Mean" "F635 SD" "B635 Median" "B635 Mean" > "B635 SD" "% > B635+1SD" "% > B635+2SD" "F635 % Sat." "F532 > Median" "F532 Mean" "F532 SD" "B532 Median" "B532 Mean" > "B532 SD" "% > B532+1SD" "% > B532+2SD" "F532 % Sat." "Ratio of > Medians" "Ratio of Means" "Median of Ratios" "Mean of > Ratios" "Ratios SD" "Rgn Ratio" "Rgn R²" "F Pixels" > "B Pixels" "Sum of Medians" "Sum of Means" "Log Ratio" > "F635 Median - B635" "F532 Median - B532" "F635 Mean - B635" "F532 > Mean - B532" "Flags" > 1 1 1 "demoA" "demorep1" 1690 5730 110 183 > 181 42 59 62 25 100 98 0 276 270 > 48 64 65 13 100 100 0 0.585 0.592 > 0.570 0.576 1.357 0.591 0.782 80 621 336 328 > -0.774 124 212 122 206 0 > 1 2 1 "demoB" "demorep2" 1910 5730 120 114 > 137 175 57 61 37 71 21 0 346 341 > 80 63 65 35 96 95 0 0.201 0.288 > 0.192 0.209 2.379 0.398 0.094 120 716 340 358 > -2.312 57 283 80 278 0 > 1 3 1 "demoblank" "demoblank" 2110 5740 110 > 145 148 43 63 68 30 92 68 0 208 > 214 48 69 74 43 98 93 0 0.590 > 0.586 0.599 0.541 1.987 0.504 0.582 80 566 221 230 > -0.761 82 139 85 145 0 > 1 4 1 "demoBLANK" "demoBLANK" 2300 5730 110 > 185 187 51 59 63 23 100 96 0 298 > 294 57 64 67 24 100 98 0 0.538 > 0.557 0.526 0.538 1.599 0.549 0.730 80 590 360 358 > -0.893 126 234 128 230 0 > > > However in BASE1 it was possible to import files with problems like this. > For example, see > > http://base.vectorbase.org/raw_edit.phtml?i_r=102 (just click ok to log in) > > you can compare the imported .gpr file (scroll down to > 2 17 20) with the table of data (position 857) > > you see that "BLANK" was imported as "Blank" because "Blank" was already in > the > table. > This problem is affected how the database handles strings. MySQL is case-insensitive. Postgres on the other hand is case-sensitive and the same problem would never have appeared. The important question is if the "demoblank" and "demoBLANK" should be treated as the same reporters or not? In Postgres they are already treated as different and it would be rather hard to change that. The only way is to convert all ID:s to the same case before storing them in the database. In MySQL they are treated as the same and it is equally hard to change that. The problem appears here because the two reporters are in the same file. If there had been two different raw data files, both "demoblank" and "demoBLANK" would have mapped to the same reporter. The bug in our code is that when the lines are in the same file we do case-sensitive comparison to check what has already been inserted. I'll add a ticket for that as well. > Tomorrow I'll see how far I get with fixing the input files. Ideally I want > to be able to continue to import raw data into BASE2 linked to array designs > that were migrated from BASE1. We have to fix all kinds of stuff in the files > anyway so I don't think that should be too much of a problem. > > > 4. It doesn't seem possible to "un-import" raw data (in order to > reimport it after fixing some annoying typo). The same > seems to be true of array designs (can't reimport reporter maps). No, it's not. There is functionality for it in the core but it hasn't made the web interface yet. > 5. There doesn't seem to a record of which file was used to import features > into an array design (this has been discussed on the list recently I > think). Yes, it was discussed for raw data import. The problem is that raw data doesn't always come in one file. Some platforms generate two files and some generate one file for multiple hybridizations. Maybe the same problem doesn't exist for features and there always is a one-to-one relationship to a file? Thanks for the test data. If I have time I will cut out your lines and include them in the test programs. We have not done much testing with non-perfect data and I guess there is more to find before the problems disappear. /Nicklas ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ _______________________________________________ The BASE general discussion mailing list basedb-users@lists.sourceforge.net unsubscribe: send a mail with subject "unsubscribe" to [EMAIL PROTECTED]