RE: PIG Cassandra - Performance

Badrinarayanan S Fri, 17 Jun 2011 19:08:35 -0700

Hi Jeremy,

Thanks. Till we get 1.0 we will also adopt separate CF for analysis
purposes.

Regards,
badri

-----Original Message-----
From: Jeremy Hanna [mailto:jeremy.hanna1...@gmail.com] 
Sent: Saturday, June 18, 2011 12:39 AM
To: user@pig.apache.org
Subject: Re: PIG Cassandra - Performance

The way cassandra currently does mapreduce is that it iterates over all the
rows of the column family.  So yes, performance would be related to the
growing number of rows.  You can use the pig FILTER function to filter them
down, but you are still iterating over all of the rows in that columns
family.

There is a ticket - CASSANDRA-1600
(https://issues.apache.org/jira/browse/CASSANDRA-1600) that addresses this
and allows for subsets of rows to be specified.  It will also enable
mapreducing over secondary indexes in a column family.  We had hoped 1600
would be resolved by now but there was a complication with a dependent
issue.  I have been told that it will definitely be in the next major
release of Cassandra - 1.0, due out in the beginning of October.  From what
I understand, these updates will then enable both pig and hive to more
easily push down selects of subsets of data.

Until then, what we've done is set up a separate column family with data
that we want to analyze that only has a subset of the data.  Then when 1.0
comes out, we'll shift over to use that.

Jeremy

On Jun 17, 2011, at 1:29 PM, Badrinarayanan S wrote:

> Hi,
> 
> 
> 
> In our production Cassandra systems we are observing the time taken by
same
> PIG script keeps increasing each and every day. The PIG scripts reads data
> for a day at a time from a Cassandra Column Family. The number of rows the
> PIG script is expected to return is almost same every day, however every
day
> the amount of rows we are storing in Cassandra is increasing. We haven't
> changed the default setting for multiquery, it is by default enabled.
> 
> 
> 
> Could this increase in PIG script execution time be related to the
> increasing number of rows in Cassandra every day? 
> 
> 
> 
> Related to this I was trying to understand the behavior of LOAD statement.
> Does LOAD statement reads all the data from Cassandra and then applies the
> required filter conditions? If so the increase in execution time could be
> attributed to the extra time required to read the ever increasing data in
> Cassandra.
> 
> 
> 
> We are also working on a suitable archival mechanisms for our data so that
> the total number of rows that are stored are always maintained at an
optimum
> count. This should also help us to maintain almost constant PIG script
> execution time every day.
> 
> 
> 
> Please advice.
> 
> 
> 
> Thanks,
> 
> Badri
> 
> 
> 
> 
> 
> 
>

RE: PIG Cassandra - Performance

Reply via email to