Yes, I think the general plan of the DataFrames maintainers is to largely rely
on packages like Query and StructuredQueries for data manipulation.
There is another benefit of having this kind of query infrastructure in its own
package: all the query operations that I showed that use Query also work
against any other data source, i.e. you can use those with arrays,
dictionaries, directly with a CSV source (and everything will be streamed, no
allocations of intermediates!), TypedTables, IndexedTables and many more.
From: karbar...@gmail.com [mailto:karbar...@gmail.com] On Behalf Of Jacob Quinn
Sent: Wednesday, October 12, 2016 11:14 PM
Subject: Re: [julia-users] Filtering DataFrame with a function
I think the Julia ecosystem is evolving tremendously in this respect. I think
originally, there were a lot of these "mammoth" packages that tried to provide
everything and the kitchen sink. Unfortunately, this has led to package bloat,
package inefficiencies in terms of load times and installation, and
unmaintainability. DataFrames and Gadfly are great examples.
The trend more recently has been a rededication to small, modular packages that
interopt nicely with others. This means moving things **out** of packages that
aren't totally essential: or in the case of DataFrames, that can include things
like IO (CSV.jl), data manipulation (Query.jl and StructuredQuery.jl), and
Ultimately, with the help of core languages features like
(https://github.com/JuliaLang/julia/issues/15705), I think we'll continue to
see packages slim down. This, of course, opens up more possibilities in the
future for so-called "meta" packages that could bundle several packages
together. These "meta" packages are then essentially tasked with tracking
versions, dependencies, and so forth while individual packages can focus on
simple, solid code.
On Wed, Oct 12, 2016 at 11:20 PM, Júlio Hoffimann <julio.hoffim...@gmail.com
<mailto:julio.hoffim...@gmail.com> > wrote:
Thank you very Much David, these queries you showed are really nice. I meant
that ideally I wouldn't need to install another package for a simple filter
operation on the rows.
2016-10-12 22:14 GMT-07:00 <anth...@berkeley.edu <mailto:anth...@berkeley.edu>
Were you worried about Query being not lightweight enough in terms of overhead,
or in terms of syntax?
I just added a more lightweight syntax for this scenario to Query. You can now
do the following two things:
q = @where(df, i->i.price > 30.)
that will return a filtered iterator. You can materialize that into a DataFrame
with collect(q, DataFrame).
I also added a counting option. Turns out that is actually a LINQ query
operator, and the goal is to implement all of those in Query. The syntax is
@count(df, i->i.price > 30.)
returns the number of rows for which the filter condition is true.
Under the hood both of these new syntax options use the normal Query machinery,
this just provides a simpler syntax relative to the more elaborate things I've
posted earlier. In terms of LINQ, this corresponds to the method invocation API
that LINQ has. I'm still figuring out how to surface something like @count in
the query expression syntax, but for now one can use it via this macro.
All of this is on master right now, so you would have to do
Pkg.checkout("Query") to get these macros.
On Wednesday, October 12, 2016 at 6:47:15 PM UTC-7, Júlio Hoffimann wrote:
Thank you for your elaborated answer and for writing a package for general
queries, that is great! I will keep the package in mind if I need something
I am currently looking for a lightweight solution within DataFrames, filtering
is a very common operation. Right now, I am considering converting the
DataFrame to an array and looping over the rows. I wonder if there is a
syntactic sugar for this loop.
2016-10-12 17:48 GMT-07:00 David Anthoff < <mailto:ant...@berkeley.edu>
you can use the Query package for the first part. To filter a DataFrame using
some arbitrary julia expression, use something like this:
using DataFrames, Query, NamedTuples
q = @from i in df begin
@where <filter expression>
You can use any julia code in <filter expression>. Say your DataFrame has a
column called price, then you could filter like this:
@where i.price > 30.
The i will be a NamedTuple type, so you can access the columns either by their
name, or also by their index, e.g.
@where i > 30.
if you want to filter by the first column. You can also just call some function
that you have defined somewhere else:
As long as the <julia expression> returns a Bool, you should be good.
If you run a query like this, q will be a standard julia iterator. Right now
you can’t just say length(q), although that is something I should probably
enable at some point (I’m also looking into the VB LINQ syntax that supports
things like counting in the query expression itself).
But you could materialize the query as an array and then look at the length of
q = @from i in df begin
@where <filter expression>
count = length(q)
The @collect statement means that the query will return an array of a
NamedTuple type (you can also materialize it into a whole bunch of other data
structures, take a look at the documentation).
Let me know if this works, or if you have any other feedback on Query.jl, I’m
much in need of some user feedback for the package at this point. Best way for
that is to open issues here https://github.com/davidanthoff/Query.jl.
From: julia...@googlegroups.com <mailto:julia...@googlegroups.com>
[mailto:julia...@googlegroups.com] On Behalf Of Júlio Hoffimann
Sent: Wednesday, October 12, 2016 5:20 PM
To: julia-users <julia...@googlegroups.com <mailto:julia...@googlegroups.com> >
Subject: [julia-users] Filtering DataFrame with a function
I have a DataFrame for which I want to filter rows that match a given criteria.
I don't have the number of columns beforehand, so I cannot explicitly list the
criteria with the :symbol syntax or write down a fixed number of indices.
Is there any way to filter with a lambda expression? Or even better, is there
any efficient way to count the number of occurrences of a specific row of