Re: [FastBit-users] Using FastBit to Query Massive Logs

K. John Wu Fri, 01 Nov 2013 08:16:41 -0700

Hi, Mohan,

If you are still interested, you can check out the latest source code
form the SVN repository with


svn checkout https://codeforge.lbl.gov/anonscm/fastbit

The updated ardea.cpp accepts an option in the form of

-p number-of-rows-per-partition

This should break up the input CSV file into manageable partitions for
you.

Let us know how it went when you do get a chance to try it.

Good luck.

John



On 10/24/13, 6:26 AM, Mohan Embar wrote:
> Hi John & Andrew,
> 
> Thanks for your replies. I'm using CentOS 5.9 (as a VM), not Cygwin.
> 
> For the query I mentioned:
> 
> ../fastbit-ibis1.3.7/examples/
> ibis -d tmp -v -q "where cid=246973 and iid=7"
> 
> ...I thought it would print a result set. Is this not the case for the
> above query? Or is it the memory issues that would prevent this from
> happening?
> 
> 
> 
> On Wed, Oct 23, 2013 at 11:04 PM, John <[email protected]
> <mailto:[email protected]>> wrote:
> 
>     Hi, Mohan,
> 
>     Are you using cygwin?  The pthread library seems to have some
>     problems under cygwin.  As far as I know the warning messages are
>     harmless in this case.  If you are not using cygwin, then please
>     give us a little more details.
> 
>     If you have 378M rows in one data partition, then it is likely
>     that you are spilling virtual memory to disk.  You should consider
>     separate them into 4 - 10 different partitions.  Currently
>     ardea.cpp is not able to separate a single CSV file into multiple
>     data partitions, so you will have to split your CSV file somehow
>     before calling ardea.
> 
>     -- John Wu
> 
>     On Oct 23, 2013, at 2:42 PM, Mohan Embar <[email protected]
>     <mailto:[email protected]>> wrote:
> 
>>     Hi John,
>>
>>     Thanks for your quick reply!
>>
>>     I wasn't clear on how I would go about adding data. Wouldn't
>>     that require rebuilding the indexes each time, which would be an
>>     expensive operation?
>>
>>     I have 378M rows and I just imported them with:
>>
>>     ../fastbit-ibis1.3.7/examples/ardea -d tmp -m "cid:int, iid:int,
>>     date:int, type:short" -t data.csv
>>
>>     Then I tried to do:
>>
>>     ../fastbit-ibis1.3.7/examples/ibis -d tmp -v -q "where
>>     cid=246973 and iid=7"
>>
>>     ...and I get a boatload of messages like this:
>>
>>     Constructed a part named tmp
>>     query[QkrXsK8cBV-----0]::setWhereClause -- where "cid=246973 and
>>     iid=7"
>>     Warning -- part[tmp]::gainWriteAccess --
>>     pthread_rwlock_trywrlock(0x859c9d8) for freeRIDs returned 16
>>     (Device or resource busy)
>>     Warning -- part[tmp]::gainWriteAccess --
>>     pthread_rwlock_trywrlock(0x859c9d8) for freeRIDs returned 16
>>     (Device or resource busy)
>>     (millions of times)
>>     ....
>>
>>     ...before finally printing:
>>
>>     query[QkrXsK8cBV-----0]::evaluate -- time to compute the 35
>>     hits: 25.2392 sec(CPU), 25.3412 sec(elapsed).
>>     query[QkrXsK8cBV-----0]::evaluate -- user root FROM tmp WHERE
>>     cid=246973 and iid=7 ==> 35 hits.
>>     doQuery:: evaluate( FROM tmp WHERE cid=246973 and iid=7)
>>     produced 35 hits, took 25.2392 CPU seconds, 25.3
>>
>>     Wasn't sure how to make it print the actual results rather than
>>     the count or whether that error message was because I had too
>>     many rows.
>>
>>     Thanks in advance for any help with this.
>>
>>
>>     On Wed, Oct 23, 2013 at 2:37 PM, John <[email protected]
>>     <mailto:[email protected]>> wrote:
>>
>>         Thanks for your interest in FastBit.  Given the types of
>>         data and the type of query, FastBit would be the perfect
>>         tool.  Do you have a sense of how many rows you would have?
>>          If you have more than 100 million, you will likely need to
>>         break them into multiple partitions.
>>
>>         -- John Wu
>>
>>         > On Oct 23, 2013, at 9:08 AM, Mohan Embar <[email protected]
>>         <mailto:[email protected]>> wrote:
>>         >
>>         > Hello,
>>         >
>>         > I'm working on a project where we need to query massive
>>         amounts of log data (stored in MySQL) and was wondering if
>>         you could help me evaluate the suitability of FastBit for this.
>>         >
>>         > The relevant columns are:
>>         >
>>         > contact id: (unsigned int)
>>         > item id: (unsigned int)
>>         > date: (unsigned int)
>>         > type: (numeric value from 0-30)
>>         >
>>         > I want to be able to answer questions like "give me all
>>         contacts who have type X, type Y, but not type Z". etc.
>>         >
>>         > I think FastBit is well-suited for this, but the issue is
>>         that new log entries are continuously being added, which
>>         would preclude FastBit being able to grow these in realtime.
>>         Log entries aren't being removed however.
>>         >
>>         > Would FastBit be appropriate for this approach? If not,
>>         how would you suggest that I reason about comparing the
>>         following alternatives:
>>         >
>>         > - Use a hybrid FastBit / MySQL approach where I submit a
>>         query to the known log entries in FastBit, then the same
>>         query against the remainder of the MySQL records which
>>         haven't yet been added to FastBit (which would be
>>         comparatively small)
>>         >
>>         > - Use another approach (Precog)
>>         >
>>         > Thanks in advance!
>>         > _______________________________________________
>>         > FastBit-users mailing list
>>         > [email protected]
>>         <mailto:[email protected]>
>>         > https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users
>>         _______________________________________________
>>         FastBit-users mailing list
>>         [email protected]
>>         <mailto:[email protected]>
>>         https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users
>>
>>
>>     _______________________________________________
>>     FastBit-users mailing list
>>     [email protected] <mailto:[email protected]>
>>     https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users
> 
>     _______________________________________________
>     FastBit-users mailing list
>     [email protected] <mailto:[email protected]>
>     https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users
> 
> 
> 
> 
> _______________________________________________
> FastBit-users mailing list
> [email protected]
> https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users
> 
_______________________________________________
FastBit-users mailing list
[email protected]
https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users

Re: [FastBit-users] Using FastBit to Query Massive Logs

Reply via email to