On Thu, Apr 7, 2011 at 7:00 PM, Peter Cock <p.j.a.c...@googlemail.com> wrote:
> On Thu, Apr 7, 2011 at 4:28 PM, Peter Cock <p.j.a.c...@googlemail.com> wrote:
>> Hi all,
>> I have just found a problem using the "Filter data on any column using
>> simple expressions" tool, i.e. files tools/stats/filters.xml and
>> tools/stats/filters.py
>> I have some six column tabular like this, where I have used \t for a
>> tab, and \n for the new lines:
>> #ID\tHMM_Sprob_score\tSP_len\tRXLR_start\tEER_start\tRXLR?\n
>> gi|301087619|ref|XP_002894699.1|\t0.990\t21\t54\t64\tY\n
>> gi|301087623|ref|XP_002894700.1|\t0.997\t23\t\t\tN\n
>> gi|301087628|ref|XP_002894701.1|\t0.000\t24\t\t\tN\n
>> Breakdown of my data:
>> Column 1 - ID, mandatory string
>> Column 2 - HMM_Sprob_score, mandatory float
>> Column 3 - SP_len, mandatory integer
>> Column 4 - RXLR_start, optional integer
>> Column 5 - EER_start, optional integer
>> Column 6 - RXLR?, mandatory string (Y or N)
>> Notice that in my output columns 4 and 5 can be empty or an integer.
>> I'm trying to filter this file using c6=='Y', i.e. column six is a
>> yes. This works (one row output) but Galaxy tells me:
>> Info: Filtering with c6=='Y',
>> kept 100.00% of 4 lines.
>> Skipped 3 invalid lines starting at line #1: "#ID HMM_Sprob_score
>> SP_len RXLR_start EER_start RXLR?"
>> Then if I try to filter using c6=='N', i.e. column six is a no, it
>> fails to work (zero rows of output instead of three) and tells me:
>> kept 0.00% of 4 lines.
>> Skipped 3 invalid lines starting at line #1: "#ID HMM_Sprob_score
>> SP_len RXLR_start EER_start RXLR?"
>> Digging into the code, tools/stats/filters.py gets given the list of
>> column types from Galaxy and (regardless of which columns are to be
>> used) attempts to cast them to integers, floats, etc.
>> It looks like Galaxy has decided that my columns 4 and 5 are integers
>> (based on the first row), and therefore filters.py blindly tries to
>> using int(...) on all these entries and that fails on the empty cells.
>> I see several issues,
>> (a) The filters.py tool only really needs to cast those columns being
>> used for the filter (fairly easy to fix)
>> (b) The galaxy column type detection seems a bit fragile (hard to
>> really fix without looking at all the data).
>> (c) Are there other tools that would break in a similar way to filter.py?
> Also:
> (d) This probably also explains why the filter tool doesn't like my header
> row (which starts with a #) since the captions are not numeric. Skipping
> these is probably a different bug fix though.
> Peter

To address these issues with the filters.py tool I've filed the
following bugs with fixes:


