Hi, Greg,

The mystery might be related to the lazy updating of the min/max
values.  Even when the min and max values are wrong in the metadata
file, FastBit should be able to answer the queries correctly.  Our
first application wanted us to use the min and max as nominal lower
and upper bounds, the actually min and max could vary significantly
from the nominal bounds.  To enforce the computation of the min and
max values, please call ibis::part::computeMinMax.

When you initialize an ibis::part with a single string argument, it is
assumed to be a directory name if it contains directory separators or
it names an existing directory.  If the string does not contain a '/'
or does not name an existing directory, then it is necessary to have
the second string argument (which could be nil) to tell FastBit to use
the first argument as the directory name.

Let me know if you have any additional questions.

John




On 8/14/12 12:58 PM, Greg Barker wrote:
> Hi John,
> 
> I've been running into a scenario where I'm not able to deactivate
> rows that exist in the data file. I noticed when it gets into this
> state, the min & max for my_primary_key in -part.txt seems to be
> incorrect. I'm having trouble coming up with a small program that can
> reproduce the issue, but this seems to get pretty close. Before I ran
> it, the three directories it uses existed and were empty.
> 
> $ cat 7rows.csv
> 1,93.19,AAA
> 2,49.14,BBB
> 3,49.19,DDD
> 4,59.10,EEE
> 5,34.48,FFF
> 6,91.49,AAA
> 7,19.50,BBB
> 
> $ cat 5rows.csv
> 1,93.19,AAA
> 2,49.14,BBB
> 3,50.41,CCC
> 4,58.59,AAA
> 5,19.53,CCC
> 
> $ cat loading_error.cc
> #include <memory>
> 
> #include <ibis.h>
> 
> int main(int argc, char **argv)
> {
>     ibis::gVerbose = 1;
> 
>     char existing_dir[] = "existing_dir";
>     char first_incoming_dir[] = "first_incoming_dir";
>     char second_incoming_dir[] = "second_incoming_dir";
> 
>     std::auto_ptr<ibis::tablex> firstTable(ibis::tablex::create());
>     firstTable->addColumn("my_primary_key", ibis::LONG);
>     firstTable->addColumn("my_double_value", ibis::DOUBLE);
>     firstTable->addColumn("my_category_value", ibis::CATEGORY);
>     firstTable->readCSV("7rows.csv", 0, first_incoming_dir, ",");
>     firstTable->write(first_incoming_dir, "working", NULL, NULL, NULL);
>     firstTable->clearData();
> 
>     ibis::part existing_part(existing_dir, static_cast<const char*>(0));
>     existing_part.append(first_incoming_dir);
>     existing_part.commit(first_incoming_dir);
>     existing_part.purgeIndexFiles();
>     existing_part.buildIndexes();
>     existing_part.emptyCache();
> 
>     std::auto_ptr<ibis::tablex> secondTable(ibis::tablex::create());
>     secondTable->addColumn("my_primary_key", ibis::LONG);
>     secondTable->addColumn("my_double_value", ibis::DOUBLE);
>     secondTable->addColumn("my_category_value", ibis::CATEGORY);
>     secondTable->readCSV("5rows.csv", 0, second_incoming_dir, ",");
>     secondTable->write(second_incoming_dir, "working", NULL, NULL, NULL);
>     secondTable->clearData();
> 
>     ibis::part second_part(second_incoming_dir);
> 
>     int deactivatedCount = 0;
>     deactivatedCount = existing_part.deactivate("my_primary_key in (1,
> 2, 3, 4, 5)");
>     std::cout << "deactivatedCount = " << deactivatedCount << std::endl;
>     existing_part.purgeInactive();
> 
>     existing_part.append(second_incoming_dir);
>     existing_part.commit(second_incoming_dir);
>     existing_part.purgeIndexFiles();
>     existing_part.buildIndexes();
>     existing_part.emptyCache();
> }
> 
> I end up with this in the -part.txt in existing_dir:
> 
> Begin Column
> name = "my_primary_key"
> data_type = "LONG"
> minimum = 6
> maximum = 7
> End Column
> 
> I was thinking it should have min = 1 & max = 7.
> 
> Thank you,
> Greg
> 
> On Mon, Aug 13, 2012 at 9:13 PM, Greg Barker <[email protected]
> <mailto:[email protected]>> wrote:
> 
>     Whoops my mistake, deactivate() returns the number of inactive
>     rows, just like it says in the doc :)
> 
>     Greg
> 
> 
>     On Mon, Aug 13, 2012 at 6:11 PM, Greg Barker
>     <[email protected] <mailto:[email protected]>> wrote:
> 
>         Hello John,
> 
>         Thank you for the updated code, it appears to be working quite
>         well now for that case. I really appreciate it.
> 
>         Another thing I noticed while I was testing is that if you
>         call deactivate() multiple times before purgeInactive(), the
>         return value was not what I expected. Do I need to call
>         purgeInactive() after each deactivate()?
> 
>         For example:
> 
>         int deactivatedCount = 0;
>         deactivatedCount += existing_part.deactivate("my_primary_key
>         in (1, 2)");
>         deactivatedCount += existing_part.deactivate("my_primary_key
>         in (3, 4)");
>         existing_part.purgeInactive();
>         std::cout << "deactivatedCount = " << deactivatedCount << "\n";
> 
>         Which yields:
> 
>         part[existing_dir]::deactivate marked 2 rows as inactive,
>         leaving 3 active rows out of 5
>         part[existing_dir]::deactivate marked 2 rows as inactive,
>         leaving 1 active row out of 5
>         part[existing_dir]::purgeInactive to remove 4 out of 5 rows
>         deactivatedCount = 6
> 
>         Thanks again for your work,
> 
>         Greg
> 
> 
>         On Mon, Aug 13, 2012 at 4:10 PM, K. John Wu <[email protected]
>         <mailto:[email protected]>> wrote:
> 
>             Hi, Greg,
> 
>             Thanks for the test case and test code.  The problem
>             should be fix
>             with SVN Revision 538.  Please give it a try when you get
>             the chance.
> 
>             There is a one minor change to your test program in order
>             to it to do
>             what you want.  The following line,
> 
>                  ibis::part existing_part(existing_dir);
> 
>             needs to be changed to
> 
>                  ibis::part existing_part(existing_dir,
>             static_cast<const char*>(0));
> 
>             The version you used will create two directories hidden in
>             .ibis,
>             which are probably not what you want.
> 
>             John
> 
> 
> 
>             On 8/13/12 1:57 AM, Greg Barker wrote:
>             > Hello,
>             >
>             > The type of my_primary_key is a long. I was able to
>             reproduce the
>             > error without the join, I also noticed that it does not
>             hit the seg
>             > fault if the category column is omitted. The following
>             program will
>             > hit the error.
>             >
>             > $ cat first_data_file.csv
>             > 1,93.19,AAA
>             > 2,49.14,BBB
>             > 3,50.41,CCC
>             > 4,58.59,AAA
>             > 5,19.53,CCC
>             >
>             > $ cat second_data_file.csv
>             > 3,49.19,DDD
>             > 4,59.10,EEE
>             > 5,34.48,FFF
>             > 6,91.49,AAA
>             > 7,19.50,BBB
>             >
>             > $ cat loading_error.cc
>             > #include <memory>
>             >
>             > #include <ibis.h>
>             >
>             > int main(int argc, char **argv)
>             > {
>             >     char existing_dir[] = "existing_dir";
>             >     char first_incoming_dir[] = "first_incoming_dir";
>             >     char second_incoming_dir[] = "second_incoming_dir";
>             >
>             >     std::auto_ptr<ibis::tablex>
>             firstTable(ibis::tablex::create());
>             >     firstTable->addColumn("my_primary_key", ibis::LONG);
>             >     firstTable->addColumn("my_double_value", ibis::DOUBLE);
>             >     firstTable->addColumn("my_category_value",
>             ibis::CATEGORY);
>             >     firstTable->readCSV("first_data_file.csv", 0,
>             first_incoming_dir,
>             > ",");
>             >     firstTable->write(first_incoming_dir, "working",
>             NULL, NULL, NULL);
>             >     firstTable->clearData();
>             >
>             >     ibis::part existing_part(existing_dir);
>             >     existing_part.append(first_incoming_dir);
>             >     existing_part.commit(first_incoming_dir);
>             >     existing_part.purgeIndexFiles();
>             >     existing_part.buildIndexes();
>             >     existing_part.emptyCache();
>             >
>             >     std::auto_ptr<ibis::tablex>
>             secondTable(ibis::tablex::create());
>             >     secondTable->addColumn("my_primary_key", ibis::LONG);
>             >     secondTable->addColumn("my_double_value", ibis::DOUBLE);
>             >     secondTable->addColumn("my_category_value",
>             ibis::CATEGORY);
>             >     secondTable->readCSV("second_data_file.csv", 0,
>             > second_incoming_dir, ",");
>             >     secondTable->write(second_incoming_dir, "working",
>             NULL, NULL, NULL);
>             >     secondTable->clearData();
>             >
>             >     ibis::part second_part(second_incoming_dir);
>             >
>             >     existing_part.deactivate("my_primary_key = 1");
>             >     existing_part.purgeInactive();
>             >
>             >     existing_part.append(second_incoming_dir);
>             > }
>             >
>             > Thank you John,
>             >
>             > Greg
>             >
>             > On Sun, Aug 12, 2012 at 3:27 PM, K. John Wu <[email protected]
>             <mailto:[email protected]>
>             > <mailto:[email protected] <mailto:[email protected]>>> wrote:
>             >
>             >     Hi, Greg,
>             >
>             >     Thanks for the information.  Looks like we might
>             have neglected to
>             >     close some index files or somehow mishandled some
>             index files.  There
>             >     is only easy thing for us to check, this is related
>             to the handling of
>             >     categorical values (the columns of type
>             ibis::CATEGORY).  Would you
>             >     mind tell us if my_primary_key is an integer column
>             or a CATEGORY
>             >     column?
>             >
>             >     If it is not a CATEGORY, then we might have
>             something a little bit
>             >     more complex.  We would appreciate a small test case
>             to replicate the
>             >     problem.
>             >
>             >     John
>             >
>             >
>             >     On 8/10/12 5:32 PM, Greg Barker wrote:
>             >     > Hello -
>             >     >
>             >     > I am attempting to append some new data to some
>             existing data,
>             >     and ran
>             >     > into some trouble. When loading, I join the new
>             data to the existing
>             >     > data on a particular column, and then deactivate &
>             purgeInactive on
>             >     > the matching records. Then when I try to append
>             the new data to the
>             >     > existing data, I hit a seg fault using rev 536. If I
>             >     > call purgeIndexFiles before the append, it seems
>             to avoid the crash,
>             >     > but I wasn't sure if that was recommended?
>             >     >
>             >     > My code is essentially:
>             >     >
>             >     >     ibis::part existing_part("my_data");
>             >     >     ibis::part incoming_part("new_data");
>             >     >     std::auto_ptr<ibis::quaere>
>             >     >     join(ibis::quaere::create(&existing_part,
>             &incoming_part,
>             >     >     "my_primary_key"));
>             >     >     std::auto_ptr<ibis::table>
>             rs(join->select("my_primary_key"));
>             >     >     //then build the where clause
>             >     >     working_part.deactivate("my_primary_key in (3,
>             4, 5)");
>             >     >     working_part.purgeInactive();
>             >     >     working_part.append(incoming_data);
>             >     >
>             >     >
>             >     > Which yields the following:
>             >     >
>             >     >     part[my_data]::deactivate marked 9 rows as
>             inactive, leaving 10
>             >     >     active rows out of 19
>             >     >     part[my_data]::purgeInactive to remove 9 out
>             of 19 rows
>             >     >     Warning -- fileManager::flushDir can not
>             remove in-memory file
>             >     >     (my_data/my_primary_key.idx).  It is in use
>             >     >     Warning -- fileManager::flushDir(my_data)
>             finished with 1 file
>             >     >     still in memory
>             >     >     Constructed a part named my_data
>             >     >     filter::sift1S -- processing data partition
>             my_data
>             >     >     Segmentation fault (core dumped)
>             >     >
>             >     > Many Thanks,
>             >     > Greg
>             >
>             >
> 
> 
> 
> 
_______________________________________________
FastBit-users mailing list
[email protected]
https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users

Reply via email to