Re: [FastBit-users] huge .sp files

K. John Wu Fri, 10 Jul 2009 08:55:15 -0700

Hi, Andrew,

The better solution in your case would be to enable the group by 
operations to perform correctly on string values columns.  I have a 
preliminary implementation checked into the SVN repository.  Please 
give it a try when you get the chance and let us know if you spot any 
problems.  Thanks.


John


PS:  You probably have to wait till a nightly snapshot is made (which 
seems to happen daily around 3AM Pacific Time in the US) by the system 
in order for you to get hold of the new code.  You can get the nightly 
snapshot at <https://codeforge.lbl.gov/snapshots.php?group_id=44>.



On 7/9/2009 7:08 AM, Andrew Olson wrote:
> Hi John,
> 
> In that case, can you help me understand how to do natural joins?  I  
> created a lookup table to replace seqname with seqid.  This was  
> necessary to get groupby(seqid) to work.  So an example SQL query that  
> I would like to run could be something like this:
> 
> select lut.seqname, data.start, data.end from lut, data where  
> lut.seqid = data.seqid and data.score > 1000;
> 
> Andrew
> 
> On Jul 8, 2009, at 9:00 PM, K. John Wu wrote:
> 
>> Hi, Andrew,
>>
>> I was looking at the possibility of removing .sp files for categorical
>> values.  It turned out that I have promised some user certain
>> functionality that requires this file a couple of months ago.
>> Therefore, it would not be possible to remove .sp files without also
>> taking back that functionality.  The easier thing would be for us to
>> keep .sp files around.  Sorry for jumping the gun on this one..
>>
>> In light of this, your best bet might be to convert your categorical
>> values into integers yourself and ask FastBit to manage the integers.
>>
>> John
>>
>> PS: The particular function is called ibis::column::selectLongs, it is
>> specialized for ibis::text and ibis::category to return the starting
>> positions of the strings selected.  The .sp files contains these
>> starting positions needed for these versions of selectLongs.  This is
>> a work-around for some users to access strings through the C API,
>> which does not have a nice way to pass back strings.  The idea is that
>>  the users can get the starting positions of the selected strings and
>> read the string themselves.
>>
>>
>> On 7/7/2009 12:05 PM, Andrew Olson wrote:
>>> Thanks in advance!  The different type of integers that I was
>>> suggesting would go in the file that replaces the .sp.  Maybe the
>>> first byte of the file could indicate what type of integers are in
>>> this file?  The dictionary would remain the same.  After the
>>> dictionary is read into memory, you would need to properly cast the
>>> offsets so they could be used to fetch the corresponding string.
>>> Andrew
>>>
>>> On Jul 7, 2009, at 2:43 PM, K. John Wu wrote:
>>>
>>>> Hi, Andrew,
>>>>
>>>> You are right that FastBit should be able to remove .sp file for
>>>> categorical values.  It might be relatively easy to do this, we will
>>>> look into it in the next few weeks.
>>>>
>>>> Thanks.
>>>>
>>>> John
>>>>
>>>> PS: FastBit does not currently have the flexibility to change the  
>>>> type
>>>> of integer used for a dictionary, but we will keep it in mind when  
>>>> we
>>>> revising the software to see if it would be easy to do.
>>>>
>>>>
>>>> On 7/7/2009 10:54 AM, Andrew Olson wrote:
>>>>> Suppose the query were "select ID where ...".  In this case does it
>>>>> proceed roughly as follows?
>>>>> 1) evaluate the where clause resulting in a bitmap
>>>>> 2) use this bitmap to get the offsets from ID.sp
>>>>> 3) and get the corresponding strings from the data file ID. (during
>>>>> step 2?)
>>>>>
>>>>> It seems like the .sp files are used only when reading strings  
>>>>> out of
>>>>> the raw data file, but in the case of ibis::category columns, the
>>>>> dictionary could be used instead.  It would save the most space to
>>>>> replace the raw data file with offsets into the dictionary array
>>>>> stored as 8, 16, 32, 64 bit integers depending on cardinality.  In
>>>>> practice you probably don't want to destroy the raw data file so  
>>>>> this
>>>>> could be an auxiliary file to replace .sp.  After thinking about  
>>>>> this
>>>>> for 2 minutes, it seems like there are no performance issues with
>>>>> this
>>>>> approach.  Have you considered this in the past?
>>>>>
>>>>> Andrew
>>>>>
>>>>>
>>>>> On Jul 7, 2009, at 1:25 PM, K. John Wu wrote:
>>>>>
>>>>>> while evaluating the where clause, ID.idx and seqid.idx will be
>>>>>> used.
>>>>>> Since ID is a category, the file ID.dic will be read in first so
>>>>>> that  the string value "RepeatMask" will be translated to a bitmap
>>>>>> in
>>>>>> ID.idx.   The file ID and seqid might be used if the software
>>>>>> decided
>>>>>> it would be cheaper to use them.  When the file ID is used, the  
>>>>>> file
>>>>>> ID.sp will also be used to remove the need to reading all strings
>>>>>> sequentially.
>>>>>>
>>>>>> To process the select clause, files seqid and start will be used.
>>>>>> The
>>>>>> order by and group by operations are performed in memory at this
>>>>>> point.
>>>>>>
>>>>>> John
>>>>>>
>>>>>>
>>>>>> On 7/7/2009 9:58 AM, Andrew Olson wrote:
>>>>>>> Thanks for the explanation.  One more question.  Can you outline
>>>>>>> when
>>>>>>> each data and auxiliary file (.idx, .dic, .sp) is used when
>>>>>>> evaluating
>>>>>>> a query like this:
>>>>>>>
>>>>>>> "select seqid, floor(start/1000) as bin where ID=RepeatMask and
>>>>>>> seqid
>>>>>>>> 0"
>>>>>>> Andrew
>>>>>>>
>>>>>>> On Jul 7, 2009, at 12:34 PM, K. John Wu wrote:
>>>>>>>
>>>>>>>> Hi, Andrew,
>>>>>>>>
>>>>>>>> Yes, the .sp files are of the expected sizes.  Basically, for  
>>>>>>>> each
>>>>>>>> string valued column (including both ibis::text and
>>>>>>>> ibis::category),
>>>>>>>> the raw data file contains the strings.  The .sp file is an
>>>>>>>> auxiliary
>>>>>>>> file to store the starting positions of the strings in the raw
>>>>>>>> data
>>>>>>>> file.  Since the strings can be long, even though we restrict  
>>>>>>>> the
>>>>>>>> number of rows in a partition (i.e., in a directory) to be no  
>>>>>>>> more
>>>>>>>> than 2 billion, we have to use 64-bit positions in the .sp  
>>>>>>>> files.
>>>>>>>> Therefore, each value in the .sp file is 8-byte long.  For your
>>>>>>>> table
>>>>>>>> with 23198575 rows, there are 23198576 values in each .sp file
>>>>>>>> (one
>>>>>>>> value for each string, plus an extra one for the end of file),
>>>>>>>> which
>>>>>>>> adds up to 185588608 bytes as you've seen.
>>>>>>>>
>>>>>>>> An alternative to asking FastBit to store strings would be do  
>>>>>>>> the
>>>>>>>> conversion yourself and ask Fastbit to store integers.  For your
>>>>>>>> data
>>>>>>>> it sees to be possible to use 8-bit or 16-bit integers, which
>>>>>>>> should
>>>>>>>> save space even more..
>>>>>>>>
>>>>>>>> John
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On 7/7/2009 8:42 AM, Andrew Olson wrote:
>>>>>>>>> Some of my data tables have several low cardinality string
>>>>>>>>> columns of
>>>>>>>>> type "category".  The .sp files that are generated are really
>>>>>>>>> large,
>>>>>>>>> and from my reading of the documentation, unnecessary.  Is this
>>>>>>>>> what
>>>>>>>>> is expected?
>>>>>>>>>
>>>>>>>>> Andrew
>>>>>>>>>
>>>>>>>>> bash-3.2$ ls -la
>>>>>>>>> total 3599152
>>>>>>>>> -rw-r--r--   1 olson  staff        543 Jul  7 11:28 -part.txt
>>>>>>>>> drwxr-xr-x  25 olson  staff        850 Jul  7 11:32 .
>>>>>>>>> drwxr-xr-x   4 olson  staff        136 Jul  7 09:15 ..
>>>>>>>>> -rw-r--r--   1 olson  staff  181535172 Jul  7 09:15 ID
>>>>>>>>> -rw-r--r--   1 olson  staff         62 Jul  7 11:29 ID.dic
>>>>>>>>> -rw-r--r--   1 olson  staff    2208968 Jul  7 11:29 ID.idx
>>>>>>>>> -rw-r--r--   1 olson  staff  185588608 Jul  7 11:29 ID.sp
>>>>>>>>> -rw-r--r--   1 olson  staff   92794300 Jul  7 09:15 end
>>>>>>>>> -rw-r--r--   1 olson  staff   46397150 Jul  7 09:15 phase
>>>>>>>>> -rw-r--r--   1 olson  staff          2 Jul  7 11:29 phase.dic
>>>>>>>>> -rw-r--r--   1 olson  staff         76 Jul  7 11:29 phase.idx
>>>>>>>>> -rw-r--r--   1 olson  staff  185588608 Jul  7 11:29 phase.sp
>>>>>>>>> -rw-r--r--   1 olson  staff   92794300 Jul  7 09:15 score
>>>>>>>>> -rw-r--r--   1 olson  staff   92794300 Jul  7 09:15 seqid
>>>>>>>>> -rw-r--r--   1 olson  staff   42309008 Jul  7 11:29 seqid.idx
>>>>>>>>> -rw-r--r--   1 olson  staff  255184325 Jul  7 09:15 source
>>>>>>>>> -rw-r--r--   1 olson  staff         11 Jul  7 11:29 source.dic
>>>>>>>>> -rw-r--r--   1 olson  staff         76 Jul  7 11:29 source.idx
>>>>>>>>> -rw-r--r--   1 olson  staff  185588608 Jul  7 11:29 source.sp
>>>>>>>>> -rw-r--r--   1 olson  staff   92794300 Jul  7 09:15 start
>>>>>>>>> -rw-r--r--   1 olson  staff  149670216 Jul  7 11:32 start.idx
>>>>>>>>> -rw-r--r--   1 olson  staff   49362151 Jul  7 09:15 strand
>>>>>>>>> -rw-r--r--   1 olson  staff          7 Jul  7 11:29 strand.dic
>>>>>>>>> -rw-r--r--   1 olson  staff    2511064 Jul  7 11:29 strand.idx
>>>>>>>>> -rw-r--r--   1 olson  staff  185588608 Jul  7 11:29 strand.sp
>>>>>>>>> bash-3.2$
>>>>>>>>>
>>>>>>>>> And here is the content of my -part.txt file
>>>>>>>>> BEGIN HEADER
>>>>>>>>> dataset.Name=repeat_region
>>>>>>>>> Number_of_rows=23198575
>>>>>>>>> Number_of_columns=8
>>>>>>>>> Table_State=1
>>>>>>>>> END HEADER
>>>>>>>>>
>>>>>>>>> BEGIN Column
>>>>>>>>> name=source
>>>>>>>>> data_type=Category
>>>>>>>>> END Column
>>>>>>>>>
>>>>>>>>> BEGIN Column
>>>>>>>>> name=ID
>>>>>>>>> data_type=Category
>>>>>>>>> END Column
>>>>>>>>>
>>>>>>>>> BEGIN Column
>>>>>>>>> name=score
>>>>>>>>> data_type=Float
>>>>>>>>> END Column
>>>>>>>>>
>>>>>>>>> BEGIN Column
>>>>>>>>> name=seqid
>>>>>>>>> data_type=Unsigned
>>>>>>>>> END Column
>>>>>>>>>
>>>>>>>>> BEGIN Column
>>>>>>>>> name=end
>>>>>>>>> data_type=Unsigned
>>>>>>>>> END Column
>>>>>>>>>
>>>>>>>>> BEGIN Column
>>>>>>>>> name=phase
>>>>>>>>> data_type=Category
>>>>>>>>> END Column
>>>>>>>>>
>>>>>>>>> BEGIN Column
>>>>>>>>> name=strand
>>>>>>>>> data_type=Category
>>>>>>>>> END Column
>>>>>>>>>
>>>>>>>>> BEGIN Column
>>>>>>>>> name=start
>>>>>>>>> data_type=Unsigned
>>>>>>>>> END Column
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> FastBit-users mailing list
>>>>>>>>> [email protected]
>>>>>>>>> https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users
>>>>>>>> _______________________________________________
>>>>>>>> FastBit-users mailing list
>>>>>>>> [email protected]
>>>>>>>> https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users
>>>>>>> _______________________________________________
>>>>>>> FastBit-users mailing list
>>>>>>> [email protected]
>>>>>>> https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users
>>>>>> _______________________________________________
>>>>>> FastBit-users mailing list
>>>>>> [email protected]
>>>>>> https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users
>>>>> _______________________________________________
>>>>> FastBit-users mailing list
>>>>> [email protected]
>>>>> https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users
>>>> _______________________________________________
>>>> FastBit-users mailing list
>>>> [email protected]
>>>> https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users
>>> _______________________________________________
>>> FastBit-users mailing list
>>> [email protected]
>>> https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users
>> _______________________________________________
>> FastBit-users mailing list
>> [email protected]
>> https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users
> 
> _______________________________________________
> FastBit-users mailing list
> [email protected]
> https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users
_______________________________________________
FastBit-users mailing list
[email protected]
https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users

Re: [FastBit-users] huge .sp files

Reply via email to