Re: [FastBit-users] huge .sp files

Andrew Olson Thu, 09 Jul 2009 07:13:24 -0700

Hi John,

In that case, can you help me understand how to do natural joins?  I  
created a lookup table to replace seqname with seqid.  This was  
necessary to get groupby(seqid) to work.  So an example SQL query that  
I would like to run could be something like this:


select lut.seqname, data.start, data.end from lut, data where  
lut.seqid = data.seqid and data.score > 1000;

Andrew

On Jul 8, 2009, at 9:00 PM, K. John Wu wrote:

> Hi, Andrew,
>
> I was looking at the possibility of removing .sp files for categorical
> values.  It turned out that I have promised some user certain
> functionality that requires this file a couple of months ago.
> Therefore, it would not be possible to remove .sp files without also
> taking back that functionality.  The easier thing would be for us to
> keep .sp files around.  Sorry for jumping the gun on this one..
>
> In light of this, your best bet might be to convert your categorical
> values into integers yourself and ask FastBit to manage the integers.
>
> John
>
> PS: The particular function is called ibis::column::selectLongs, it is
> specialized for ibis::text and ibis::category to return the starting
> positions of the strings selected.  The .sp files contains these
> starting positions needed for these versions of selectLongs.  This is
> a work-around for some users to access strings through the C API,
> which does not have a nice way to pass back strings.  The idea is that
>  the users can get the starting positions of the selected strings and
> read the string themselves.
>
>
> On 7/7/2009 12:05 PM, Andrew Olson wrote:
>> Thanks in advance!  The different type of integers that I was
>> suggesting would go in the file that replaces the .sp.  Maybe the
>> first byte of the file could indicate what type of integers are in
>> this file?  The dictionary would remain the same.  After the
>> dictionary is read into memory, you would need to properly cast the
>> offsets so they could be used to fetch the corresponding string.
>> Andrew
>>
>> On Jul 7, 2009, at 2:43 PM, K. John Wu wrote:
>>
>>> Hi, Andrew,
>>>
>>> You are right that FastBit should be able to remove .sp file for
>>> categorical values.  It might be relatively easy to do this, we will
>>> look into it in the next few weeks.
>>>
>>> Thanks.
>>>
>>> John
>>>
>>> PS: FastBit does not currently have the flexibility to change the  
>>> type
>>> of integer used for a dictionary, but we will keep it in mind when  
>>> we
>>> revising the software to see if it would be easy to do.
>>>
>>>
>>> On 7/7/2009 10:54 AM, Andrew Olson wrote:
>>>> Suppose the query were "select ID where ...".  In this case does it
>>>> proceed roughly as follows?
>>>> 1) evaluate the where clause resulting in a bitmap
>>>> 2) use this bitmap to get the offsets from ID.sp
>>>> 3) and get the corresponding strings from the data file ID. (during
>>>> step 2?)
>>>>
>>>> It seems like the .sp files are used only when reading strings  
>>>> out of
>>>> the raw data file, but in the case of ibis::category columns, the
>>>> dictionary could be used instead.  It would save the most space to
>>>> replace the raw data file with offsets into the dictionary array
>>>> stored as 8, 16, 32, 64 bit integers depending on cardinality.  In
>>>> practice you probably don't want to destroy the raw data file so  
>>>> this
>>>> could be an auxiliary file to replace .sp.  After thinking about  
>>>> this
>>>> for 2 minutes, it seems like there are no performance issues with
>>>> this
>>>> approach.  Have you considered this in the past?
>>>>
>>>> Andrew
>>>>
>>>>
>>>> On Jul 7, 2009, at 1:25 PM, K. John Wu wrote:
>>>>
>>>>> while evaluating the where clause, ID.idx and seqid.idx will be
>>>>> used.
>>>>> Since ID is a category, the file ID.dic will be read in first so
>>>>> that  the string value "RepeatMask" will be translated to a bitmap
>>>>> in
>>>>> ID.idx.   The file ID and seqid might be used if the software
>>>>> decided
>>>>> it would be cheaper to use them.  When the file ID is used, the  
>>>>> file
>>>>> ID.sp will also be used to remove the need to reading all strings
>>>>> sequentially.
>>>>>
>>>>> To process the select clause, files seqid and start will be used.
>>>>> The
>>>>> order by and group by operations are performed in memory at this
>>>>> point.
>>>>>
>>>>> John
>>>>>
>>>>>
>>>>> On 7/7/2009 9:58 AM, Andrew Olson wrote:
>>>>>> Thanks for the explanation.  One more question.  Can you outline
>>>>>> when
>>>>>> each data and auxiliary file (.idx, .dic, .sp) is used when
>>>>>> evaluating
>>>>>> a query like this:
>>>>>>
>>>>>> "select seqid, floor(start/1000) as bin where ID=RepeatMask and
>>>>>> seqid
>>>>>>> 0"
>>>>>> Andrew
>>>>>>
>>>>>> On Jul 7, 2009, at 12:34 PM, K. John Wu wrote:
>>>>>>
>>>>>>> Hi, Andrew,
>>>>>>>
>>>>>>> Yes, the .sp files are of the expected sizes.  Basically, for  
>>>>>>> each
>>>>>>> string valued column (including both ibis::text and
>>>>>>> ibis::category),
>>>>>>> the raw data file contains the strings.  The .sp file is an
>>>>>>> auxiliary
>>>>>>> file to store the starting positions of the strings in the raw
>>>>>>> data
>>>>>>> file.  Since the strings can be long, even though we restrict  
>>>>>>> the
>>>>>>> number of rows in a partition (i.e., in a directory) to be no  
>>>>>>> more
>>>>>>> than 2 billion, we have to use 64-bit positions in the .sp  
>>>>>>> files.
>>>>>>> Therefore, each value in the .sp file is 8-byte long.  For your
>>>>>>> table
>>>>>>> with 23198575 rows, there are 23198576 values in each .sp file
>>>>>>> (one
>>>>>>> value for each string, plus an extra one for the end of file),
>>>>>>> which
>>>>>>> adds up to 185588608 bytes as you've seen.
>>>>>>>
>>>>>>> An alternative to asking FastBit to store strings would be do  
>>>>>>> the
>>>>>>> conversion yourself and ask Fastbit to store integers.  For your
>>>>>>> data
>>>>>>> it sees to be possible to use 8-bit or 16-bit integers, which
>>>>>>> should
>>>>>>> save space even more..
>>>>>>>
>>>>>>> John
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On 7/7/2009 8:42 AM, Andrew Olson wrote:
>>>>>>>> Some of my data tables have several low cardinality string
>>>>>>>> columns of
>>>>>>>> type "category".  The .sp files that are generated are really
>>>>>>>> large,
>>>>>>>> and from my reading of the documentation, unnecessary.  Is this
>>>>>>>> what
>>>>>>>> is expected?
>>>>>>>>
>>>>>>>> Andrew
>>>>>>>>
>>>>>>>> bash-3.2$ ls -la
>>>>>>>> total 3599152
>>>>>>>> -rw-r--r--   1 olson  staff        543 Jul  7 11:28 -part.txt
>>>>>>>> drwxr-xr-x  25 olson  staff        850 Jul  7 11:32 .
>>>>>>>> drwxr-xr-x   4 olson  staff        136 Jul  7 09:15 ..
>>>>>>>> -rw-r--r--   1 olson  staff  181535172 Jul  7 09:15 ID
>>>>>>>> -rw-r--r--   1 olson  staff         62 Jul  7 11:29 ID.dic
>>>>>>>> -rw-r--r--   1 olson  staff    2208968 Jul  7 11:29 ID.idx
>>>>>>>> -rw-r--r--   1 olson  staff  185588608 Jul  7 11:29 ID.sp
>>>>>>>> -rw-r--r--   1 olson  staff   92794300 Jul  7 09:15 end
>>>>>>>> -rw-r--r--   1 olson  staff   46397150 Jul  7 09:15 phase
>>>>>>>> -rw-r--r--   1 olson  staff          2 Jul  7 11:29 phase.dic
>>>>>>>> -rw-r--r--   1 olson  staff         76 Jul  7 11:29 phase.idx
>>>>>>>> -rw-r--r--   1 olson  staff  185588608 Jul  7 11:29 phase.sp
>>>>>>>> -rw-r--r--   1 olson  staff   92794300 Jul  7 09:15 score
>>>>>>>> -rw-r--r--   1 olson  staff   92794300 Jul  7 09:15 seqid
>>>>>>>> -rw-r--r--   1 olson  staff   42309008 Jul  7 11:29 seqid.idx
>>>>>>>> -rw-r--r--   1 olson  staff  255184325 Jul  7 09:15 source
>>>>>>>> -rw-r--r--   1 olson  staff         11 Jul  7 11:29 source.dic
>>>>>>>> -rw-r--r--   1 olson  staff         76 Jul  7 11:29 source.idx
>>>>>>>> -rw-r--r--   1 olson  staff  185588608 Jul  7 11:29 source.sp
>>>>>>>> -rw-r--r--   1 olson  staff   92794300 Jul  7 09:15 start
>>>>>>>> -rw-r--r--   1 olson  staff  149670216 Jul  7 11:32 start.idx
>>>>>>>> -rw-r--r--   1 olson  staff   49362151 Jul  7 09:15 strand
>>>>>>>> -rw-r--r--   1 olson  staff          7 Jul  7 11:29 strand.dic
>>>>>>>> -rw-r--r--   1 olson  staff    2511064 Jul  7 11:29 strand.idx
>>>>>>>> -rw-r--r--   1 olson  staff  185588608 Jul  7 11:29 strand.sp
>>>>>>>> bash-3.2$
>>>>>>>>
>>>>>>>> And here is the content of my -part.txt file
>>>>>>>> BEGIN HEADER
>>>>>>>> dataset.Name=repeat_region
>>>>>>>> Number_of_rows=23198575
>>>>>>>> Number_of_columns=8
>>>>>>>> Table_State=1
>>>>>>>> END HEADER
>>>>>>>>
>>>>>>>> BEGIN Column
>>>>>>>> name=source
>>>>>>>> data_type=Category
>>>>>>>> END Column
>>>>>>>>
>>>>>>>> BEGIN Column
>>>>>>>> name=ID
>>>>>>>> data_type=Category
>>>>>>>> END Column
>>>>>>>>
>>>>>>>> BEGIN Column
>>>>>>>> name=score
>>>>>>>> data_type=Float
>>>>>>>> END Column
>>>>>>>>
>>>>>>>> BEGIN Column
>>>>>>>> name=seqid
>>>>>>>> data_type=Unsigned
>>>>>>>> END Column
>>>>>>>>
>>>>>>>> BEGIN Column
>>>>>>>> name=end
>>>>>>>> data_type=Unsigned
>>>>>>>> END Column
>>>>>>>>
>>>>>>>> BEGIN Column
>>>>>>>> name=phase
>>>>>>>> data_type=Category
>>>>>>>> END Column
>>>>>>>>
>>>>>>>> BEGIN Column
>>>>>>>> name=strand
>>>>>>>> data_type=Category
>>>>>>>> END Column
>>>>>>>>
>>>>>>>> BEGIN Column
>>>>>>>> name=start
>>>>>>>> data_type=Unsigned
>>>>>>>> END Column
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> FastBit-users mailing list
>>>>>>>> [email protected]
>>>>>>>> https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users
>>>>>>> _______________________________________________
>>>>>>> FastBit-users mailing list
>>>>>>> [email protected]
>>>>>>> https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users
>>>>>> _______________________________________________
>>>>>> FastBit-users mailing list
>>>>>> [email protected]
>>>>>> https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users
>>>>> _______________________________________________
>>>>> FastBit-users mailing list
>>>>> [email protected]
>>>>> https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users
>>>> _______________________________________________
>>>> FastBit-users mailing list
>>>> [email protected]
>>>> https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users
>>> _______________________________________________
>>> FastBit-users mailing list
>>> [email protected]
>>> https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users
>>
>> _______________________________________________
>> FastBit-users mailing list
>> [email protected]
>> https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users
> _______________________________________________
> FastBit-users mailing list
> [email protected]
> https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users

_______________________________________________
FastBit-users mailing list
[email protected]
https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users

Re: [FastBit-users] huge .sp files

Reply via email to