On Thu, Apr 7, 2011 at 4:28 PM, Peter Cock <p.j.a.c...@googlemail.com> wrote:
> Hi all,
> I have just found a problem using the "Filter data on any column using
> simple expressions" tool, i.e. files tools/stats/filters.xml and
> tools/stats/filters.py
> I have some six column tabular like this, where I have used \t for a
> tab, and \n for the new lines:
> #ID\tHMM_Sprob_score\tSP_len\tRXLR_start\tEER_start\tRXLR?\n
> gi|301087619|ref|XP_002894699.1|\t0.990\t21\t54\t64\tY\n
> gi|301087623|ref|XP_002894700.1|\t0.997\t23\t\t\tN\n
> gi|301087628|ref|XP_002894701.1|\t0.000\t24\t\t\tN\n
> Breakdown of my data:
> Column 1 - ID, mandatory string
> Column 2 - HMM_Sprob_score, mandatory float
> Column 3 - SP_len, mandatory integer
> Column 4 - RXLR_start, optional integer
> Column 5 - EER_start, optional integer
> Column 6 - RXLR?, mandatory string (Y or N)
> Notice that in my output columns 4 and 5 can be empty or an integer.
> I'm trying to filter this file using c6=='Y', i.e. column six is a
> yes. This works (one row output) but Galaxy tells me:
> Info: Filtering with c6=='Y',
> kept 100.00% of 4 lines.
> Skipped 3 invalid lines starting at line #1: "#ID HMM_Sprob_score
> SP_len RXLR_start EER_start RXLR?"
> Then if I try to filter using c6=='N', i.e. column six is a no, it
> fails to work (zero rows of output instead of three) and tells me:
> kept 0.00% of 4 lines.
> Skipped 3 invalid lines starting at line #1: "#ID HMM_Sprob_score
> SP_len RXLR_start EER_start RXLR?"
> Digging into the code, tools/stats/filters.py gets given the list of
> column types from Galaxy and (regardless of which columns are to be
> used) attempts to cast them to integers, floats, etc.
> It looks like Galaxy has decided that my columns 4 and 5 are integers
> (based on the first row), and therefore filters.py blindly tries to
> using int(...) on all these entries and that fails on the empty cells.
> I see several issues,
> (a) The filters.py tool only really needs to cast those columns being
> used for the filter (fairly easy to fix)
> (b) The galaxy column type detection seems a bit fragile (hard to
> really fix without looking at all the data).
> (c) Are there other tools that would break in a similar way to filter.py?

(d) This probably also explains why the filter tool doesn't like my header
row (which starts with a #) since the captions are not numeric. Skipping
these is probably a different bug fix though.

Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:


Reply via email to