Hi John,

I've been running into a scenario where I'm not able to deactivate rows
that exist in the data file. I noticed when it gets into this state, the
min & max for my_primary_key in -part.txt seems to be incorrect. I'm having
trouble coming up with a small program that can reproduce the issue, but
this seems to get pretty close. Before I ran it, the three directories it
uses existed and were empty.

$ cat 7rows.csv
1,93.19,AAA
2,49.14,BBB
3,49.19,DDD
4,59.10,EEE
5,34.48,FFF
6,91.49,AAA
7,19.50,BBB

$ cat 5rows.csv
1,93.19,AAA
2,49.14,BBB
3,50.41,CCC
4,58.59,AAA
5,19.53,CCC

$ cat loading_error.cc
#include <memory>

#include <ibis.h>

int main(int argc, char **argv)
{
    ibis::gVerbose = 1;

    char existing_dir[] = "existing_dir";
    char first_incoming_dir[] = "first_incoming_dir";
    char second_incoming_dir[] = "second_incoming_dir";

    std::auto_ptr<ibis::tablex> firstTable(ibis::tablex::create());
    firstTable->addColumn("my_primary_key", ibis::LONG);
    firstTable->addColumn("my_double_value", ibis::DOUBLE);
    firstTable->addColumn("my_category_value", ibis::CATEGORY);
    firstTable->readCSV("7rows.csv", 0, first_incoming_dir, ",");
    firstTable->write(first_incoming_dir, "working", NULL, NULL, NULL);
    firstTable->clearData();

    ibis::part existing_part(existing_dir, static_cast<const char*>(0));
    existing_part.append(first_incoming_dir);
    existing_part.commit(first_incoming_dir);
    existing_part.purgeIndexFiles();
    existing_part.buildIndexes();
    existing_part.emptyCache();

    std::auto_ptr<ibis::tablex> secondTable(ibis::tablex::create());
    secondTable->addColumn("my_primary_key", ibis::LONG);
    secondTable->addColumn("my_double_value", ibis::DOUBLE);
    secondTable->addColumn("my_category_value", ibis::CATEGORY);
    secondTable->readCSV("5rows.csv", 0, second_incoming_dir, ",");
    secondTable->write(second_incoming_dir, "working", NULL, NULL, NULL);
    secondTable->clearData();

    ibis::part second_part(second_incoming_dir);

    int deactivatedCount = 0;
    deactivatedCount = existing_part.deactivate("my_primary_key in (1, 2,
3, 4, 5)");
    std::cout << "deactivatedCount = " << deactivatedCount << std::endl;
    existing_part.purgeInactive();

    existing_part.append(second_incoming_dir);
    existing_part.commit(second_incoming_dir);
    existing_part.purgeIndexFiles();
    existing_part.buildIndexes();
    existing_part.emptyCache();
}

I end up with this in the -part.txt in existing_dir:

Begin Column
name = "my_primary_key"
data_type = "LONG"
minimum = 6
maximum = 7
End Column

I was thinking it should have min = 1 & max = 7.

Thank you,
Greg

On Mon, Aug 13, 2012 at 9:13 PM, Greg Barker <[email protected]> wrote:

> Whoops my mistake, deactivate() returns the number of inactive rows, just
> like it says in the doc :)
>
> Greg
>
>
> On Mon, Aug 13, 2012 at 6:11 PM, Greg Barker <[email protected]>wrote:
>
>> Hello John,
>>
>> Thank you for the updated code, it appears to be working quite well now
>> for that case. I really appreciate it.
>>
>> Another thing I noticed while I was testing is that if you call
>> deactivate() multiple times before purgeInactive(), the return value was
>> not what I expected. Do I need to call purgeInactive() after each
>> deactivate()?
>>
>> For example:
>>
>> int deactivatedCount = 0;
>> deactivatedCount += existing_part.deactivate("my_primary_key in (1, 2)");
>> deactivatedCount += existing_part.deactivate("my_primary_key in (3, 4)");
>> existing_part.purgeInactive();
>> std::cout << "deactivatedCount = " << deactivatedCount << "\n";
>>
>> Which yields:
>>
>> part[existing_dir]::deactivate marked 2 rows as inactive, leaving 3
>> active rows out of 5
>> part[existing_dir]::deactivate marked 2 rows as inactive, leaving 1
>> active row out of 5
>> part[existing_dir]::purgeInactive to remove 4 out of 5 rows
>> deactivatedCount = 6
>>
>> Thanks again for your work,
>>
>> Greg
>>
>>
>> On Mon, Aug 13, 2012 at 4:10 PM, K. John Wu <[email protected]> wrote:
>>
>>> Hi, Greg,
>>>
>>> Thanks for the test case and test code.  The problem should be fix
>>> with SVN Revision 538.  Please give it a try when you get the chance.
>>>
>>> There is a one minor change to your test program in order to it to do
>>> what you want.  The following line,
>>>
>>>      ibis::part existing_part(existing_dir);
>>>
>>> needs to be changed to
>>>
>>>      ibis::part existing_part(existing_dir, static_cast<const char*>(0));
>>>
>>> The version you used will create two directories hidden in .ibis,
>>> which are probably not what you want.
>>>
>>> John
>>>
>>>
>>>
>>> On 8/13/12 1:57 AM, Greg Barker wrote:
>>> > Hello,
>>> >
>>> > The type of my_primary_key is a long. I was able to reproduce the
>>> > error without the join, I also noticed that it does not hit the seg
>>> > fault if the category column is omitted. The following program will
>>> > hit the error.
>>> >
>>> > $ cat first_data_file.csv
>>> > 1,93.19,AAA
>>> > 2,49.14,BBB
>>> > 3,50.41,CCC
>>> > 4,58.59,AAA
>>> > 5,19.53,CCC
>>> >
>>> > $ cat second_data_file.csv
>>> > 3,49.19,DDD
>>> > 4,59.10,EEE
>>> > 5,34.48,FFF
>>> > 6,91.49,AAA
>>> > 7,19.50,BBB
>>> >
>>> > $ cat loading_error.cc
>>> > #include <memory>
>>> >
>>> > #include <ibis.h>
>>> >
>>> > int main(int argc, char **argv)
>>> > {
>>> >     char existing_dir[] = "existing_dir";
>>> >     char first_incoming_dir[] = "first_incoming_dir";
>>> >     char second_incoming_dir[] = "second_incoming_dir";
>>> >
>>> >     std::auto_ptr<ibis::tablex> firstTable(ibis::tablex::create());
>>> >     firstTable->addColumn("my_primary_key", ibis::LONG);
>>> >     firstTable->addColumn("my_double_value", ibis::DOUBLE);
>>> >     firstTable->addColumn("my_category_value", ibis::CATEGORY);
>>> >     firstTable->readCSV("first_data_file.csv", 0, first_incoming_dir,
>>> > ",");
>>> >     firstTable->write(first_incoming_dir, "working", NULL, NULL, NULL);
>>> >     firstTable->clearData();
>>> >
>>> >     ibis::part existing_part(existing_dir);
>>> >     existing_part.append(first_incoming_dir);
>>> >     existing_part.commit(first_incoming_dir);
>>> >     existing_part.purgeIndexFiles();
>>> >     existing_part.buildIndexes();
>>> >     existing_part.emptyCache();
>>> >
>>> >     std::auto_ptr<ibis::tablex> secondTable(ibis::tablex::create());
>>> >     secondTable->addColumn("my_primary_key", ibis::LONG);
>>> >     secondTable->addColumn("my_double_value", ibis::DOUBLE);
>>> >     secondTable->addColumn("my_category_value", ibis::CATEGORY);
>>> >     secondTable->readCSV("second_data_file.csv", 0,
>>> > second_incoming_dir, ",");
>>> >     secondTable->write(second_incoming_dir, "working", NULL, NULL,
>>> NULL);
>>> >     secondTable->clearData();
>>> >
>>> >     ibis::part second_part(second_incoming_dir);
>>> >
>>> >     existing_part.deactivate("my_primary_key = 1");
>>> >     existing_part.purgeInactive();
>>> >
>>> >     existing_part.append(second_incoming_dir);
>>> > }
>>> >
>>> > Thank you John,
>>> >
>>> > Greg
>>> >
>>> > On Sun, Aug 12, 2012 at 3:27 PM, K. John Wu <[email protected]
>>> > <mailto:[email protected]>> wrote:
>>> >
>>> >     Hi, Greg,
>>> >
>>> >     Thanks for the information.  Looks like we might have neglected to
>>> >     close some index files or somehow mishandled some index files.
>>>  There
>>> >     is only easy thing for us to check, this is related to the
>>> handling of
>>> >     categorical values (the columns of type ibis::CATEGORY).  Would you
>>> >     mind tell us if my_primary_key is an integer column or a CATEGORY
>>> >     column?
>>> >
>>> >     If it is not a CATEGORY, then we might have something a little bit
>>> >     more complex.  We would appreciate a small test case to replicate
>>> the
>>> >     problem.
>>> >
>>> >     John
>>> >
>>> >
>>> >     On 8/10/12 5:32 PM, Greg Barker wrote:
>>> >     > Hello -
>>> >     >
>>> >     > I am attempting to append some new data to some existing data,
>>> >     and ran
>>> >     > into some trouble. When loading, I join the new data to the
>>> existing
>>> >     > data on a particular column, and then deactivate & purgeInactive
>>> on
>>> >     > the matching records. Then when I try to append the new data to
>>> the
>>> >     > existing data, I hit a seg fault using rev 536. If I
>>> >     > call purgeIndexFiles before the append, it seems to avoid the
>>> crash,
>>> >     > but I wasn't sure if that was recommended?
>>> >     >
>>> >     > My code is essentially:
>>> >     >
>>> >     >     ibis::part existing_part("my_data");
>>> >     >     ibis::part incoming_part("new_data");
>>> >     >     std::auto_ptr<ibis::quaere>
>>> >     >     join(ibis::quaere::create(&existing_part, &incoming_part,
>>> >     >     "my_primary_key"));
>>> >     >     std::auto_ptr<ibis::table>
>>> rs(join->select("my_primary_key"));
>>> >     >     //then build the where clause
>>> >     >     working_part.deactivate("my_primary_key in (3, 4, 5)");
>>> >     >     working_part.purgeInactive();
>>> >     >     working_part.append(incoming_data);
>>> >     >
>>> >     >
>>> >     > Which yields the following:
>>> >     >
>>> >     >     part[my_data]::deactivate marked 9 rows as inactive, leaving
>>> 10
>>> >     >     active rows out of 19
>>> >     >     part[my_data]::purgeInactive to remove 9 out of 19 rows
>>> >     >     Warning -- fileManager::flushDir can not remove in-memory
>>> file
>>> >     >     (my_data/my_primary_key.idx).  It is in use
>>> >     >     Warning -- fileManager::flushDir(my_data) finished with 1
>>> file
>>> >     >     still in memory
>>> >     >     Constructed a part named my_data
>>> >     >     filter::sift1S -- processing data partition my_data
>>> >     >     Segmentation fault (core dumped)
>>> >     >
>>> >     > Many Thanks,
>>> >     > Greg
>>> >
>>> >
>>>
>>
>>
>
_______________________________________________
FastBit-users mailing list
[email protected]
https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users

Reply via email to