Hi, Mohan, If you are still interested, you can check out the latest source code form the SVN repository with
svn checkout https://codeforge.lbl.gov/anonscm/fastbit The updated ardea.cpp accepts an option in the form of -p number-of-rows-per-partition This should break up the input CSV file into manageable partitions for you. Let us know how it went when you do get a chance to try it. Good luck. John On 10/24/13, 6:26 AM, Mohan Embar wrote: > Hi John & Andrew, > > Thanks for your replies. I'm using CentOS 5.9 (as a VM), not Cygwin. > > For the query I mentioned: > > ../fastbit-ibis1.3.7/examples/ > ibis -d tmp -v -q "where cid=246973 and iid=7" > > ...I thought it would print a result set. Is this not the case for the > above query? Or is it the memory issues that would prevent this from > happening? > > > > On Wed, Oct 23, 2013 at 11:04 PM, John <[email protected] > <mailto:[email protected]>> wrote: > > Hi, Mohan, > > Are you using cygwin? The pthread library seems to have some > problems under cygwin. As far as I know the warning messages are > harmless in this case. If you are not using cygwin, then please > give us a little more details. > > If you have 378M rows in one data partition, then it is likely > that you are spilling virtual memory to disk. You should consider > separate them into 4 - 10 different partitions. Currently > ardea.cpp is not able to separate a single CSV file into multiple > data partitions, so you will have to split your CSV file somehow > before calling ardea. > > -- John Wu > > On Oct 23, 2013, at 2:42 PM, Mohan Embar <[email protected] > <mailto:[email protected]>> wrote: > >> Hi John, >> >> Thanks for your quick reply! >> >> I wasn't clear on how I would go about adding data. Wouldn't >> that require rebuilding the indexes each time, which would be an >> expensive operation? >> >> I have 378M rows and I just imported them with: >> >> ../fastbit-ibis1.3.7/examples/ardea -d tmp -m "cid:int, iid:int, >> date:int, type:short" -t data.csv >> >> Then I tried to do: >> >> ../fastbit-ibis1.3.7/examples/ibis -d tmp -v -q "where >> cid=246973 and iid=7" >> >> ...and I get a boatload of messages like this: >> >> Constructed a part named tmp >> query[QkrXsK8cBV-----0]::setWhereClause -- where "cid=246973 and >> iid=7" >> Warning -- part[tmp]::gainWriteAccess -- >> pthread_rwlock_trywrlock(0x859c9d8) for freeRIDs returned 16 >> (Device or resource busy) >> Warning -- part[tmp]::gainWriteAccess -- >> pthread_rwlock_trywrlock(0x859c9d8) for freeRIDs returned 16 >> (Device or resource busy) >> (millions of times) >> .... >> >> ...before finally printing: >> >> query[QkrXsK8cBV-----0]::evaluate -- time to compute the 35 >> hits: 25.2392 sec(CPU), 25.3412 sec(elapsed). >> query[QkrXsK8cBV-----0]::evaluate -- user root FROM tmp WHERE >> cid=246973 and iid=7 ==> 35 hits. >> doQuery:: evaluate( FROM tmp WHERE cid=246973 and iid=7) >> produced 35 hits, took 25.2392 CPU seconds, 25.3 >> >> Wasn't sure how to make it print the actual results rather than >> the count or whether that error message was because I had too >> many rows. >> >> Thanks in advance for any help with this. >> >> >> On Wed, Oct 23, 2013 at 2:37 PM, John <[email protected] >> <mailto:[email protected]>> wrote: >> >> Thanks for your interest in FastBit. Given the types of >> data and the type of query, FastBit would be the perfect >> tool. Do you have a sense of how many rows you would have? >> If you have more than 100 million, you will likely need to >> break them into multiple partitions. >> >> -- John Wu >> >> > On Oct 23, 2013, at 9:08 AM, Mohan Embar <[email protected] >> <mailto:[email protected]>> wrote: >> > >> > Hello, >> > >> > I'm working on a project where we need to query massive >> amounts of log data (stored in MySQL) and was wondering if >> you could help me evaluate the suitability of FastBit for this. >> > >> > The relevant columns are: >> > >> > contact id: (unsigned int) >> > item id: (unsigned int) >> > date: (unsigned int) >> > type: (numeric value from 0-30) >> > >> > I want to be able to answer questions like "give me all >> contacts who have type X, type Y, but not type Z". etc. >> > >> > I think FastBit is well-suited for this, but the issue is >> that new log entries are continuously being added, which >> would preclude FastBit being able to grow these in realtime. >> Log entries aren't being removed however. >> > >> > Would FastBit be appropriate for this approach? If not, >> how would you suggest that I reason about comparing the >> following alternatives: >> > >> > - Use a hybrid FastBit / MySQL approach where I submit a >> query to the known log entries in FastBit, then the same >> query against the remainder of the MySQL records which >> haven't yet been added to FastBit (which would be >> comparatively small) >> > >> > - Use another approach (Precog) >> > >> > Thanks in advance! >> > _______________________________________________ >> > FastBit-users mailing list >> > [email protected] >> <mailto:[email protected]> >> > https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users >> _______________________________________________ >> FastBit-users mailing list >> [email protected] >> <mailto:[email protected]> >> https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users >> >> >> _______________________________________________ >> FastBit-users mailing list >> [email protected] <mailto:[email protected]> >> https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users > > _______________________________________________ > FastBit-users mailing list > [email protected] <mailto:[email protected]> > https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users > > > > > _______________________________________________ > FastBit-users mailing list > [email protected] > https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users > _______________________________________________ FastBit-users mailing list [email protected] https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users
