Re: [FastBit-users] huge .sp files

K. John Wu Thu, 09 Jul 2009 11:20:58 -0700

Hi, Andrew,

Instead of having another "table", I would encourage to try 
ibis::dictionary to hold the "lookup table".  In the C++ program that 
performs SQL query, you will be basically using one ibis::table called 
data and execute a call to function select with select clause "seqid, 
start, end" and where clause "score > 1000".  You will get back an 
in-memory table with three columns: seqid, start, and end.  When print 
out the tuples out, instead of printing the value of seqid, use the 
dictionary to lookup the string representation.


The tricky part is that ibis::dictionary maps normal strings from 1 up 
with 0 reserved for null strings (and treated separately).  If you 
have mapped strings from 1 up, you can write the dictionary file as 
follows.  The string assigned integer 1 appears first in the 
dictionary file with the null terminator, the string assigned value 2 
appears next, and so on.  To write the null terminator, open the 
output in binary mode, for a string with nchar = strlen(...) bytes, 
write nchar+1 bytes (instead of nchar).

John

PS: As I am writing this message, I am starting to question the wisdom 
of exporting file offsets to the end user -- the rationale for that 
was to make it possible for end users to access strings.  It does not 
seem to be a particularly bright idea to start with.  This might be a 
good opportunity to get rid of it...

On 7/9/2009 7:08 AM, Andrew Olson wrote:
> Hi John,
> 
> In that case, can you help me understand how to do natural joins?  I  
> created a lookup table to replace seqname with seqid.  This was  
> necessary to get groupby(seqid) to work.  So an example SQL query that  
> I would like to run could be something like this:
> 
> select lut.seqname, data.start, data.end from lut, data where  
> lut.seqid = data.seqid and data.score > 1000;
> 
> Andrew
> 
> On Jul 8, 2009, at 9:00 PM, K. John Wu wrote:
> 
>> Hi, Andrew,
>>
>> I was looking at the possibility of removing .sp files for categorical
>> values.  It turned out that I have promised some user certain
>> functionality that requires this file a couple of months ago.
>> Therefore, it would not be possible to remove .sp files without also
>> taking back that functionality.  The easier thing would be for us to
>> keep .sp files around.  Sorry for jumping the gun on this one..
>>
>> In light of this, your best bet might be to convert your categorical
>> values into integers yourself and ask FastBit to manage the integers.
>>
>> John
>>
>> PS: The particular function is called ibis::column::selectLongs, it is
>> specialized for ibis::text and ibis::category to return the starting
>> positions of the strings selected.  The .sp files contains these
>> starting positions needed for these versions of selectLongs.  This is
>> a work-around for some users to access strings through the C API,
>> which does not have a nice way to pass back strings.  The idea is that
>>  the users can get the starting positions of the selected strings and
>> read the string themselves.
>>
>>
>> On 7/7/2009 12:05 PM, Andrew Olson wrote:
>>> Thanks in advance!  The different type of integers that I was
>>> suggesting would go in the file that replaces the .sp.  Maybe the
>>> first byte of the file could indicate what type of integers are in
>>> this file?  The dictionary would remain the same.  After the
>>> dictionary is read into memory, you would need to properly cast the
>>> offsets so they could be used to fetch the corresponding string.
>>> Andrew
>>>
>>> On Jul 7, 2009, at 2:43 PM, K. John Wu wrote:
>>>
>>>> Hi, Andrew,
>>>>
>>>> You are right that FastBit should be able to remove .sp file for
>>>> categorical values.  It might be relatively easy to do this, we will
>>>> look into it in the next few weeks.
>>>>
>>>> Thanks.
>>>>
>>>> John
>>>>
>>>> PS: FastBit does not currently have the flexibility to change the  
>>>> type
>>>> of integer used for a dictionary, but we will keep it in mind when  
>>>> we
>>>> revising the software to see if it would be easy to do.
>>>>
>>>>
>>>> On 7/7/2009 10:54 AM, Andrew Olson wrote:
>>>>> Suppose the query were "select ID where ...".  In this case does it
>>>>> proceed roughly as follows?
>>>>> 1) evaluate the where clause resulting in a bitmap
>>>>> 2) use this bitmap to get the offsets from ID.sp
>>>>> 3) and get the corresponding strings from the data file ID. (during
>>>>> step 2?)
>>>>>
>>>>> It seems like the .sp files are used only when reading strings  
>>>>> out of
>>>>> the raw data file, but in the case of ibis::category columns, the
>>>>> dictionary could be used instead.  It would save the most space to
>>>>> replace the raw data file with offsets into the dictionary array
>>>>> stored as 8, 16, 32, 64 bit integers depending on cardinality.  In
>>>>> practice you probably don't want to destroy the raw data file so  
>>>>> this
>>>>> could be an auxiliary file to replace .sp.  After thinking about  
>>>>> this
>>>>> for 2 minutes, it seems like there are no performance issues with
>>>>> this
>>>>> approach.  Have you considered this in the past?
>>>>>
>>>>> Andrew
>>>>>
>>>>>
>>>>> On Jul 7, 2009, at 1:25 PM, K. John Wu wrote:
>>>>>
>>>>>> while evaluating the where clause, ID.idx and seqid.idx will be
>>>>>> used.
>>>>>> Since ID is a category, the file ID.dic will be read in first so
>>>>>> that  the string value "RepeatMask" will be translated to a bitmap
>>>>>> in
>>>>>> ID.idx.   The file ID and seqid might be used if the software
>>>>>> decided
>>>>>> it would be cheaper to use them.  When the file ID is used, the  
>>>>>> file
>>>>>> ID.sp will also be used to remove the need to reading all strings
>>>>>> sequentially.
>>>>>>
>>>>>> To process the select clause, files seqid and start will be used.
>>>>>> The
>>>>>> order by and group by operations are performed in memory at this
>>>>>> point.
>>>>>>
>>>>>> John
>>>>>>
>>>>>>
>>>>>> On 7/7/2009 9:58 AM, Andrew Olson wrote:
>>>>>>> Thanks for the explanation.  One more question.  Can you outline
>>>>>>> when
>>>>>>> each data and auxiliary file (.idx, .dic, .sp) is used when
>>>>>>> evaluating
>>>>>>> a query like this:
>>>>>>>
>>>>>>> "select seqid, floor(start/1000) as bin where ID=RepeatMask and
>>>>>>> seqid
>>>>>>>> 0"
>>>>>>> Andrew
>>>>>>>
>>>>>>> On Jul 7, 2009, at 12:34 PM, K. John Wu wrote:
>>>>>>>
>>>>>>>> Hi, Andrew,
>>>>>>>>
>>>>>>>> Yes, the .sp files are of the expected sizes.  Basically, for  
>>>>>>>> each
>>>>>>>> string valued column (including both ibis::text and
>>>>>>>> ibis::category),
>>>>>>>> the raw data file contains the strings.  The .sp file is an
>>>>>>>> auxiliary
>>>>>>>> file to store the starting positions of the strings in the raw
>>>>>>>> data
>>>>>>>> file.  Since the strings can be long, even though we restrict  
>>>>>>>> the
>>>>>>>> number of rows in a partition (i.e., in a directory) to be no  
>>>>>>>> more
>>>>>>>> than 2 billion, we have to use 64-bit positions in the .sp  
>>>>>>>> files.
>>>>>>>> Therefore, each value in the .sp file is 8-byte long.  For your
>>>>>>>> table
>>>>>>>> with 23198575 rows, there are 23198576 values in each .sp file
>>>>>>>> (one
>>>>>>>> value for each string, plus an extra one for the end of file),
>>>>>>>> which
>>>>>>>> adds up to 185588608 bytes as you've seen.
>>>>>>>>
>>>>>>>> An alternative to asking FastBit to store strings would be do  
>>>>>>>> the
>>>>>>>> conversion yourself and ask Fastbit to store integers.  For your
>>>>>>>> data
>>>>>>>> it sees to be possible to use 8-bit or 16-bit integers, which
>>>>>>>> should
>>>>>>>> save space even more..
>>>>>>>>
>>>>>>>> John
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On 7/7/2009 8:42 AM, Andrew Olson wrote:
>>>>>>>>> Some of my data tables have several low cardinality string
>>>>>>>>> columns of
>>>>>>>>> type "category".  The .sp files that are generated are really
>>>>>>>>> large,
>>>>>>>>> and from my reading of the documentation, unnecessary.  Is this
>>>>>>>>> what
>>>>>>>>> is expected?
>>>>>>>>>
>>>>>>>>> Andrew
>>>>>>>>>
>>>>>>>>> bash-3.2$ ls -la
>>>>>>>>> total 3599152
>>>>>>>>> -rw-r--r--   1 olson  staff        543 Jul  7 11:28 -part.txt
>>>>>>>>> drwxr-xr-x  25 olson  staff        850 Jul  7 11:32 .
>>>>>>>>> drwxr-xr-x   4 olson  staff        136 Jul  7 09:15 ..
>>>>>>>>> -rw-r--r--   1 olson  staff  181535172 Jul  7 09:15 ID
>>>>>>>>> -rw-r--r--   1 olson  staff         62 Jul  7 11:29 ID.dic
>>>>>>>>> -rw-r--r--   1 olson  staff    2208968 Jul  7 11:29 ID.idx
>>>>>>>>> -rw-r--r--   1 olson  staff  185588608 Jul  7 11:29 ID.sp
>>>>>>>>> -rw-r--r--   1 olson  staff   92794300 Jul  7 09:15 end
>>>>>>>>> -rw-r--r--   1 olson  staff   46397150 Jul  7 09:15 phase
>>>>>>>>> -rw-r--r--   1 olson  staff          2 Jul  7 11:29 phase.dic
>>>>>>>>> -rw-r--r--   1 olson  staff         76 Jul  7 11:29 phase.idx
>>>>>>>>> -rw-r--r--   1 olson  staff  185588608 Jul  7 11:29 phase.sp
>>>>>>>>> -rw-r--r--   1 olson  staff   92794300 Jul  7 09:15 score
>>>>>>>>> -rw-r--r--   1 olson  staff   92794300 Jul  7 09:15 seqid
>>>>>>>>> -rw-r--r--   1 olson  staff   42309008 Jul  7 11:29 seqid.idx
>>>>>>>>> -rw-r--r--   1 olson  staff  255184325 Jul  7 09:15 source
>>>>>>>>> -rw-r--r--   1 olson  staff         11 Jul  7 11:29 source.dic
>>>>>>>>> -rw-r--r--   1 olson  staff         76 Jul  7 11:29 source.idx
>>>>>>>>> -rw-r--r--   1 olson  staff  185588608 Jul  7 11:29 source.sp
>>>>>>>>> -rw-r--r--   1 olson  staff   92794300 Jul  7 09:15 start
>>>>>>>>> -rw-r--r--   1 olson  staff  149670216 Jul  7 11:32 start.idx
>>>>>>>>> -rw-r--r--   1 olson  staff   49362151 Jul  7 09:15 strand
>>>>>>>>> -rw-r--r--   1 olson  staff          7 Jul  7 11:29 strand.dic
>>>>>>>>> -rw-r--r--   1 olson  staff    2511064 Jul  7 11:29 strand.idx
>>>>>>>>> -rw-r--r--   1 olson  staff  185588608 Jul  7 11:29 strand.sp
>>>>>>>>> bash-3.2$
>>>>>>>>>
>>>>>>>>> And here is the content of my -part.txt file
>>>>>>>>> BEGIN HEADER
>>>>>>>>> dataset.Name=repeat_region
>>>>>>>>> Number_of_rows=23198575
>>>>>>>>> Number_of_columns=8
>>>>>>>>> Table_State=1
>>>>>>>>> END HEADER
>>>>>>>>>
>>>>>>>>> BEGIN Column
>>>>>>>>> name=source
>>>>>>>>> data_type=Category
>>>>>>>>> END Column
>>>>>>>>>
>>>>>>>>> BEGIN Column
>>>>>>>>> name=ID
>>>>>>>>> data_type=Category
>>>>>>>>> END Column
>>>>>>>>>
>>>>>>>>> BEGIN Column
>>>>>>>>> name=score
>>>>>>>>> data_type=Float
>>>>>>>>> END Column
>>>>>>>>>
>>>>>>>>> BEGIN Column
>>>>>>>>> name=seqid
>>>>>>>>> data_type=Unsigned
>>>>>>>>> END Column
>>>>>>>>>
>>>>>>>>> BEGIN Column
>>>>>>>>> name=end
>>>>>>>>> data_type=Unsigned
>>>>>>>>> END Column
>>>>>>>>>
>>>>>>>>> BEGIN Column
>>>>>>>>> name=phase
>>>>>>>>> data_type=Category
>>>>>>>>> END Column
>>>>>>>>>
>>>>>>>>> BEGIN Column
>>>>>>>>> name=strand
>>>>>>>>> data_type=Category
>>>>>>>>> END Column
>>>>>>>>>
>>>>>>>>> BEGIN Column
>>>>>>>>> name=start
>>>>>>>>> data_type=Unsigned
>>>>>>>>> END Column
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> FastBit-users mailing list
>>>>>>>>> [email protected]
>>>>>>>>> https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users
>>>>>>>> _______________________________________________
>>>>>>>> FastBit-users mailing list
>>>>>>>> [email protected]
>>>>>>>> https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users
>>>>>>> _______________________________________________
>>>>>>> FastBit-users mailing list
>>>>>>> [email protected]
>>>>>>> https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users
>>>>>> _______________________________________________
>>>>>> FastBit-users mailing list
>>>>>> [email protected]
>>>>>> https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users
>>>>> _______________________________________________
>>>>> FastBit-users mailing list
>>>>> [email protected]
>>>>> https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users
>>>> _______________________________________________
>>>> FastBit-users mailing list
>>>> [email protected]
>>>> https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users
>>> _______________________________________________
>>> FastBit-users mailing list
>>> [email protected]
>>> https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users
>> _______________________________________________
>> FastBit-users mailing list
>> [email protected]
>> https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users
> 
> _______________________________________________
> FastBit-users mailing list
> [email protected]
> https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users
_______________________________________________
FastBit-users mailing list
[email protected]
https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users

Re: [FastBit-users] huge .sp files

Reply via email to