Hi, Andrew, Instead of having another "table", I would encourage to try ibis::dictionary to hold the "lookup table". In the C++ program that performs SQL query, you will be basically using one ibis::table called data and execute a call to function select with select clause "seqid, start, end" and where clause "score > 1000". You will get back an in-memory table with three columns: seqid, start, and end. When print out the tuples out, instead of printing the value of seqid, use the dictionary to lookup the string representation.
The tricky part is that ibis::dictionary maps normal strings from 1 up with 0 reserved for null strings (and treated separately). If you have mapped strings from 1 up, you can write the dictionary file as follows. The string assigned integer 1 appears first in the dictionary file with the null terminator, the string assigned value 2 appears next, and so on. To write the null terminator, open the output in binary mode, for a string with nchar = strlen(...) bytes, write nchar+1 bytes (instead of nchar). John PS: As I am writing this message, I am starting to question the wisdom of exporting file offsets to the end user -- the rationale for that was to make it possible for end users to access strings. It does not seem to be a particularly bright idea to start with. This might be a good opportunity to get rid of it... On 7/9/2009 7:08 AM, Andrew Olson wrote: > Hi John, > > In that case, can you help me understand how to do natural joins? I > created a lookup table to replace seqname with seqid. This was > necessary to get groupby(seqid) to work. So an example SQL query that > I would like to run could be something like this: > > select lut.seqname, data.start, data.end from lut, data where > lut.seqid = data.seqid and data.score > 1000; > > Andrew > > On Jul 8, 2009, at 9:00 PM, K. John Wu wrote: > >> Hi, Andrew, >> >> I was looking at the possibility of removing .sp files for categorical >> values. It turned out that I have promised some user certain >> functionality that requires this file a couple of months ago. >> Therefore, it would not be possible to remove .sp files without also >> taking back that functionality. The easier thing would be for us to >> keep .sp files around. Sorry for jumping the gun on this one.. >> >> In light of this, your best bet might be to convert your categorical >> values into integers yourself and ask FastBit to manage the integers. >> >> John >> >> PS: The particular function is called ibis::column::selectLongs, it is >> specialized for ibis::text and ibis::category to return the starting >> positions of the strings selected. The .sp files contains these >> starting positions needed for these versions of selectLongs. This is >> a work-around for some users to access strings through the C API, >> which does not have a nice way to pass back strings. The idea is that >> the users can get the starting positions of the selected strings and >> read the string themselves. >> >> >> On 7/7/2009 12:05 PM, Andrew Olson wrote: >>> Thanks in advance! The different type of integers that I was >>> suggesting would go in the file that replaces the .sp. Maybe the >>> first byte of the file could indicate what type of integers are in >>> this file? The dictionary would remain the same. After the >>> dictionary is read into memory, you would need to properly cast the >>> offsets so they could be used to fetch the corresponding string. >>> Andrew >>> >>> On Jul 7, 2009, at 2:43 PM, K. John Wu wrote: >>> >>>> Hi, Andrew, >>>> >>>> You are right that FastBit should be able to remove .sp file for >>>> categorical values. It might be relatively easy to do this, we will >>>> look into it in the next few weeks. >>>> >>>> Thanks. >>>> >>>> John >>>> >>>> PS: FastBit does not currently have the flexibility to change the >>>> type >>>> of integer used for a dictionary, but we will keep it in mind when >>>> we >>>> revising the software to see if it would be easy to do. >>>> >>>> >>>> On 7/7/2009 10:54 AM, Andrew Olson wrote: >>>>> Suppose the query were "select ID where ...". In this case does it >>>>> proceed roughly as follows? >>>>> 1) evaluate the where clause resulting in a bitmap >>>>> 2) use this bitmap to get the offsets from ID.sp >>>>> 3) and get the corresponding strings from the data file ID. (during >>>>> step 2?) >>>>> >>>>> It seems like the .sp files are used only when reading strings >>>>> out of >>>>> the raw data file, but in the case of ibis::category columns, the >>>>> dictionary could be used instead. It would save the most space to >>>>> replace the raw data file with offsets into the dictionary array >>>>> stored as 8, 16, 32, 64 bit integers depending on cardinality. In >>>>> practice you probably don't want to destroy the raw data file so >>>>> this >>>>> could be an auxiliary file to replace .sp. After thinking about >>>>> this >>>>> for 2 minutes, it seems like there are no performance issues with >>>>> this >>>>> approach. Have you considered this in the past? >>>>> >>>>> Andrew >>>>> >>>>> >>>>> On Jul 7, 2009, at 1:25 PM, K. John Wu wrote: >>>>> >>>>>> while evaluating the where clause, ID.idx and seqid.idx will be >>>>>> used. >>>>>> Since ID is a category, the file ID.dic will be read in first so >>>>>> that the string value "RepeatMask" will be translated to a bitmap >>>>>> in >>>>>> ID.idx. The file ID and seqid might be used if the software >>>>>> decided >>>>>> it would be cheaper to use them. When the file ID is used, the >>>>>> file >>>>>> ID.sp will also be used to remove the need to reading all strings >>>>>> sequentially. >>>>>> >>>>>> To process the select clause, files seqid and start will be used. >>>>>> The >>>>>> order by and group by operations are performed in memory at this >>>>>> point. >>>>>> >>>>>> John >>>>>> >>>>>> >>>>>> On 7/7/2009 9:58 AM, Andrew Olson wrote: >>>>>>> Thanks for the explanation. One more question. Can you outline >>>>>>> when >>>>>>> each data and auxiliary file (.idx, .dic, .sp) is used when >>>>>>> evaluating >>>>>>> a query like this: >>>>>>> >>>>>>> "select seqid, floor(start/1000) as bin where ID=RepeatMask and >>>>>>> seqid >>>>>>>> 0" >>>>>>> Andrew >>>>>>> >>>>>>> On Jul 7, 2009, at 12:34 PM, K. John Wu wrote: >>>>>>> >>>>>>>> Hi, Andrew, >>>>>>>> >>>>>>>> Yes, the .sp files are of the expected sizes. Basically, for >>>>>>>> each >>>>>>>> string valued column (including both ibis::text and >>>>>>>> ibis::category), >>>>>>>> the raw data file contains the strings. The .sp file is an >>>>>>>> auxiliary >>>>>>>> file to store the starting positions of the strings in the raw >>>>>>>> data >>>>>>>> file. Since the strings can be long, even though we restrict >>>>>>>> the >>>>>>>> number of rows in a partition (i.e., in a directory) to be no >>>>>>>> more >>>>>>>> than 2 billion, we have to use 64-bit positions in the .sp >>>>>>>> files. >>>>>>>> Therefore, each value in the .sp file is 8-byte long. For your >>>>>>>> table >>>>>>>> with 23198575 rows, there are 23198576 values in each .sp file >>>>>>>> (one >>>>>>>> value for each string, plus an extra one for the end of file), >>>>>>>> which >>>>>>>> adds up to 185588608 bytes as you've seen. >>>>>>>> >>>>>>>> An alternative to asking FastBit to store strings would be do >>>>>>>> the >>>>>>>> conversion yourself and ask Fastbit to store integers. For your >>>>>>>> data >>>>>>>> it sees to be possible to use 8-bit or 16-bit integers, which >>>>>>>> should >>>>>>>> save space even more.. >>>>>>>> >>>>>>>> John >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On 7/7/2009 8:42 AM, Andrew Olson wrote: >>>>>>>>> Some of my data tables have several low cardinality string >>>>>>>>> columns of >>>>>>>>> type "category". The .sp files that are generated are really >>>>>>>>> large, >>>>>>>>> and from my reading of the documentation, unnecessary. Is this >>>>>>>>> what >>>>>>>>> is expected? >>>>>>>>> >>>>>>>>> Andrew >>>>>>>>> >>>>>>>>> bash-3.2$ ls -la >>>>>>>>> total 3599152 >>>>>>>>> -rw-r--r-- 1 olson staff 543 Jul 7 11:28 -part.txt >>>>>>>>> drwxr-xr-x 25 olson staff 850 Jul 7 11:32 . >>>>>>>>> drwxr-xr-x 4 olson staff 136 Jul 7 09:15 .. >>>>>>>>> -rw-r--r-- 1 olson staff 181535172 Jul 7 09:15 ID >>>>>>>>> -rw-r--r-- 1 olson staff 62 Jul 7 11:29 ID.dic >>>>>>>>> -rw-r--r-- 1 olson staff 2208968 Jul 7 11:29 ID.idx >>>>>>>>> -rw-r--r-- 1 olson staff 185588608 Jul 7 11:29 ID.sp >>>>>>>>> -rw-r--r-- 1 olson staff 92794300 Jul 7 09:15 end >>>>>>>>> -rw-r--r-- 1 olson staff 46397150 Jul 7 09:15 phase >>>>>>>>> -rw-r--r-- 1 olson staff 2 Jul 7 11:29 phase.dic >>>>>>>>> -rw-r--r-- 1 olson staff 76 Jul 7 11:29 phase.idx >>>>>>>>> -rw-r--r-- 1 olson staff 185588608 Jul 7 11:29 phase.sp >>>>>>>>> -rw-r--r-- 1 olson staff 92794300 Jul 7 09:15 score >>>>>>>>> -rw-r--r-- 1 olson staff 92794300 Jul 7 09:15 seqid >>>>>>>>> -rw-r--r-- 1 olson staff 42309008 Jul 7 11:29 seqid.idx >>>>>>>>> -rw-r--r-- 1 olson staff 255184325 Jul 7 09:15 source >>>>>>>>> -rw-r--r-- 1 olson staff 11 Jul 7 11:29 source.dic >>>>>>>>> -rw-r--r-- 1 olson staff 76 Jul 7 11:29 source.idx >>>>>>>>> -rw-r--r-- 1 olson staff 185588608 Jul 7 11:29 source.sp >>>>>>>>> -rw-r--r-- 1 olson staff 92794300 Jul 7 09:15 start >>>>>>>>> -rw-r--r-- 1 olson staff 149670216 Jul 7 11:32 start.idx >>>>>>>>> -rw-r--r-- 1 olson staff 49362151 Jul 7 09:15 strand >>>>>>>>> -rw-r--r-- 1 olson staff 7 Jul 7 11:29 strand.dic >>>>>>>>> -rw-r--r-- 1 olson staff 2511064 Jul 7 11:29 strand.idx >>>>>>>>> -rw-r--r-- 1 olson staff 185588608 Jul 7 11:29 strand.sp >>>>>>>>> bash-3.2$ >>>>>>>>> >>>>>>>>> And here is the content of my -part.txt file >>>>>>>>> BEGIN HEADER >>>>>>>>> dataset.Name=repeat_region >>>>>>>>> Number_of_rows=23198575 >>>>>>>>> Number_of_columns=8 >>>>>>>>> Table_State=1 >>>>>>>>> END HEADER >>>>>>>>> >>>>>>>>> BEGIN Column >>>>>>>>> name=source >>>>>>>>> data_type=Category >>>>>>>>> END Column >>>>>>>>> >>>>>>>>> BEGIN Column >>>>>>>>> name=ID >>>>>>>>> data_type=Category >>>>>>>>> END Column >>>>>>>>> >>>>>>>>> BEGIN Column >>>>>>>>> name=score >>>>>>>>> data_type=Float >>>>>>>>> END Column >>>>>>>>> >>>>>>>>> BEGIN Column >>>>>>>>> name=seqid >>>>>>>>> data_type=Unsigned >>>>>>>>> END Column >>>>>>>>> >>>>>>>>> BEGIN Column >>>>>>>>> name=end >>>>>>>>> data_type=Unsigned >>>>>>>>> END Column >>>>>>>>> >>>>>>>>> BEGIN Column >>>>>>>>> name=phase >>>>>>>>> data_type=Category >>>>>>>>> END Column >>>>>>>>> >>>>>>>>> BEGIN Column >>>>>>>>> name=strand >>>>>>>>> data_type=Category >>>>>>>>> END Column >>>>>>>>> >>>>>>>>> BEGIN Column >>>>>>>>> name=start >>>>>>>>> data_type=Unsigned >>>>>>>>> END Column >>>>>>>>> >>>>>>>>> _______________________________________________ >>>>>>>>> FastBit-users mailing list >>>>>>>>> [email protected] >>>>>>>>> https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users >>>>>>>> _______________________________________________ >>>>>>>> FastBit-users mailing list >>>>>>>> [email protected] >>>>>>>> https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users >>>>>>> _______________________________________________ >>>>>>> FastBit-users mailing list >>>>>>> [email protected] >>>>>>> https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users >>>>>> _______________________________________________ >>>>>> FastBit-users mailing list >>>>>> [email protected] >>>>>> https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users >>>>> _______________________________________________ >>>>> FastBit-users mailing list >>>>> [email protected] >>>>> https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users >>>> _______________________________________________ >>>> FastBit-users mailing list >>>> [email protected] >>>> https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users >>> _______________________________________________ >>> FastBit-users mailing list >>> [email protected] >>> https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users >> _______________________________________________ >> FastBit-users mailing list >> [email protected] >> https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users > > _______________________________________________ > FastBit-users mailing list > [email protected] > https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users _______________________________________________ FastBit-users mailing list [email protected] https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users
