Hi John, In that case, can you help me understand how to do natural joins? I created a lookup table to replace seqname with seqid. This was necessary to get groupby(seqid) to work. So an example SQL query that I would like to run could be something like this:
select lut.seqname, data.start, data.end from lut, data where lut.seqid = data.seqid and data.score > 1000; Andrew On Jul 8, 2009, at 9:00 PM, K. John Wu wrote: > Hi, Andrew, > > I was looking at the possibility of removing .sp files for categorical > values. It turned out that I have promised some user certain > functionality that requires this file a couple of months ago. > Therefore, it would not be possible to remove .sp files without also > taking back that functionality. The easier thing would be for us to > keep .sp files around. Sorry for jumping the gun on this one.. > > In light of this, your best bet might be to convert your categorical > values into integers yourself and ask FastBit to manage the integers. > > John > > PS: The particular function is called ibis::column::selectLongs, it is > specialized for ibis::text and ibis::category to return the starting > positions of the strings selected. The .sp files contains these > starting positions needed for these versions of selectLongs. This is > a work-around for some users to access strings through the C API, > which does not have a nice way to pass back strings. The idea is that > the users can get the starting positions of the selected strings and > read the string themselves. > > > On 7/7/2009 12:05 PM, Andrew Olson wrote: >> Thanks in advance! The different type of integers that I was >> suggesting would go in the file that replaces the .sp. Maybe the >> first byte of the file could indicate what type of integers are in >> this file? The dictionary would remain the same. After the >> dictionary is read into memory, you would need to properly cast the >> offsets so they could be used to fetch the corresponding string. >> Andrew >> >> On Jul 7, 2009, at 2:43 PM, K. John Wu wrote: >> >>> Hi, Andrew, >>> >>> You are right that FastBit should be able to remove .sp file for >>> categorical values. It might be relatively easy to do this, we will >>> look into it in the next few weeks. >>> >>> Thanks. >>> >>> John >>> >>> PS: FastBit does not currently have the flexibility to change the >>> type >>> of integer used for a dictionary, but we will keep it in mind when >>> we >>> revising the software to see if it would be easy to do. >>> >>> >>> On 7/7/2009 10:54 AM, Andrew Olson wrote: >>>> Suppose the query were "select ID where ...". In this case does it >>>> proceed roughly as follows? >>>> 1) evaluate the where clause resulting in a bitmap >>>> 2) use this bitmap to get the offsets from ID.sp >>>> 3) and get the corresponding strings from the data file ID. (during >>>> step 2?) >>>> >>>> It seems like the .sp files are used only when reading strings >>>> out of >>>> the raw data file, but in the case of ibis::category columns, the >>>> dictionary could be used instead. It would save the most space to >>>> replace the raw data file with offsets into the dictionary array >>>> stored as 8, 16, 32, 64 bit integers depending on cardinality. In >>>> practice you probably don't want to destroy the raw data file so >>>> this >>>> could be an auxiliary file to replace .sp. After thinking about >>>> this >>>> for 2 minutes, it seems like there are no performance issues with >>>> this >>>> approach. Have you considered this in the past? >>>> >>>> Andrew >>>> >>>> >>>> On Jul 7, 2009, at 1:25 PM, K. John Wu wrote: >>>> >>>>> while evaluating the where clause, ID.idx and seqid.idx will be >>>>> used. >>>>> Since ID is a category, the file ID.dic will be read in first so >>>>> that the string value "RepeatMask" will be translated to a bitmap >>>>> in >>>>> ID.idx. The file ID and seqid might be used if the software >>>>> decided >>>>> it would be cheaper to use them. When the file ID is used, the >>>>> file >>>>> ID.sp will also be used to remove the need to reading all strings >>>>> sequentially. >>>>> >>>>> To process the select clause, files seqid and start will be used. >>>>> The >>>>> order by and group by operations are performed in memory at this >>>>> point. >>>>> >>>>> John >>>>> >>>>> >>>>> On 7/7/2009 9:58 AM, Andrew Olson wrote: >>>>>> Thanks for the explanation. One more question. Can you outline >>>>>> when >>>>>> each data and auxiliary file (.idx, .dic, .sp) is used when >>>>>> evaluating >>>>>> a query like this: >>>>>> >>>>>> "select seqid, floor(start/1000) as bin where ID=RepeatMask and >>>>>> seqid >>>>>>> 0" >>>>>> Andrew >>>>>> >>>>>> On Jul 7, 2009, at 12:34 PM, K. John Wu wrote: >>>>>> >>>>>>> Hi, Andrew, >>>>>>> >>>>>>> Yes, the .sp files are of the expected sizes. Basically, for >>>>>>> each >>>>>>> string valued column (including both ibis::text and >>>>>>> ibis::category), >>>>>>> the raw data file contains the strings. The .sp file is an >>>>>>> auxiliary >>>>>>> file to store the starting positions of the strings in the raw >>>>>>> data >>>>>>> file. Since the strings can be long, even though we restrict >>>>>>> the >>>>>>> number of rows in a partition (i.e., in a directory) to be no >>>>>>> more >>>>>>> than 2 billion, we have to use 64-bit positions in the .sp >>>>>>> files. >>>>>>> Therefore, each value in the .sp file is 8-byte long. For your >>>>>>> table >>>>>>> with 23198575 rows, there are 23198576 values in each .sp file >>>>>>> (one >>>>>>> value for each string, plus an extra one for the end of file), >>>>>>> which >>>>>>> adds up to 185588608 bytes as you've seen. >>>>>>> >>>>>>> An alternative to asking FastBit to store strings would be do >>>>>>> the >>>>>>> conversion yourself and ask Fastbit to store integers. For your >>>>>>> data >>>>>>> it sees to be possible to use 8-bit or 16-bit integers, which >>>>>>> should >>>>>>> save space even more.. >>>>>>> >>>>>>> John >>>>>>> >>>>>>> >>>>>>> >>>>>>> On 7/7/2009 8:42 AM, Andrew Olson wrote: >>>>>>>> Some of my data tables have several low cardinality string >>>>>>>> columns of >>>>>>>> type "category". The .sp files that are generated are really >>>>>>>> large, >>>>>>>> and from my reading of the documentation, unnecessary. Is this >>>>>>>> what >>>>>>>> is expected? >>>>>>>> >>>>>>>> Andrew >>>>>>>> >>>>>>>> bash-3.2$ ls -la >>>>>>>> total 3599152 >>>>>>>> -rw-r--r-- 1 olson staff 543 Jul 7 11:28 -part.txt >>>>>>>> drwxr-xr-x 25 olson staff 850 Jul 7 11:32 . >>>>>>>> drwxr-xr-x 4 olson staff 136 Jul 7 09:15 .. >>>>>>>> -rw-r--r-- 1 olson staff 181535172 Jul 7 09:15 ID >>>>>>>> -rw-r--r-- 1 olson staff 62 Jul 7 11:29 ID.dic >>>>>>>> -rw-r--r-- 1 olson staff 2208968 Jul 7 11:29 ID.idx >>>>>>>> -rw-r--r-- 1 olson staff 185588608 Jul 7 11:29 ID.sp >>>>>>>> -rw-r--r-- 1 olson staff 92794300 Jul 7 09:15 end >>>>>>>> -rw-r--r-- 1 olson staff 46397150 Jul 7 09:15 phase >>>>>>>> -rw-r--r-- 1 olson staff 2 Jul 7 11:29 phase.dic >>>>>>>> -rw-r--r-- 1 olson staff 76 Jul 7 11:29 phase.idx >>>>>>>> -rw-r--r-- 1 olson staff 185588608 Jul 7 11:29 phase.sp >>>>>>>> -rw-r--r-- 1 olson staff 92794300 Jul 7 09:15 score >>>>>>>> -rw-r--r-- 1 olson staff 92794300 Jul 7 09:15 seqid >>>>>>>> -rw-r--r-- 1 olson staff 42309008 Jul 7 11:29 seqid.idx >>>>>>>> -rw-r--r-- 1 olson staff 255184325 Jul 7 09:15 source >>>>>>>> -rw-r--r-- 1 olson staff 11 Jul 7 11:29 source.dic >>>>>>>> -rw-r--r-- 1 olson staff 76 Jul 7 11:29 source.idx >>>>>>>> -rw-r--r-- 1 olson staff 185588608 Jul 7 11:29 source.sp >>>>>>>> -rw-r--r-- 1 olson staff 92794300 Jul 7 09:15 start >>>>>>>> -rw-r--r-- 1 olson staff 149670216 Jul 7 11:32 start.idx >>>>>>>> -rw-r--r-- 1 olson staff 49362151 Jul 7 09:15 strand >>>>>>>> -rw-r--r-- 1 olson staff 7 Jul 7 11:29 strand.dic >>>>>>>> -rw-r--r-- 1 olson staff 2511064 Jul 7 11:29 strand.idx >>>>>>>> -rw-r--r-- 1 olson staff 185588608 Jul 7 11:29 strand.sp >>>>>>>> bash-3.2$ >>>>>>>> >>>>>>>> And here is the content of my -part.txt file >>>>>>>> BEGIN HEADER >>>>>>>> dataset.Name=repeat_region >>>>>>>> Number_of_rows=23198575 >>>>>>>> Number_of_columns=8 >>>>>>>> Table_State=1 >>>>>>>> END HEADER >>>>>>>> >>>>>>>> BEGIN Column >>>>>>>> name=source >>>>>>>> data_type=Category >>>>>>>> END Column >>>>>>>> >>>>>>>> BEGIN Column >>>>>>>> name=ID >>>>>>>> data_type=Category >>>>>>>> END Column >>>>>>>> >>>>>>>> BEGIN Column >>>>>>>> name=score >>>>>>>> data_type=Float >>>>>>>> END Column >>>>>>>> >>>>>>>> BEGIN Column >>>>>>>> name=seqid >>>>>>>> data_type=Unsigned >>>>>>>> END Column >>>>>>>> >>>>>>>> BEGIN Column >>>>>>>> name=end >>>>>>>> data_type=Unsigned >>>>>>>> END Column >>>>>>>> >>>>>>>> BEGIN Column >>>>>>>> name=phase >>>>>>>> data_type=Category >>>>>>>> END Column >>>>>>>> >>>>>>>> BEGIN Column >>>>>>>> name=strand >>>>>>>> data_type=Category >>>>>>>> END Column >>>>>>>> >>>>>>>> BEGIN Column >>>>>>>> name=start >>>>>>>> data_type=Unsigned >>>>>>>> END Column >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> FastBit-users mailing list >>>>>>>> [email protected] >>>>>>>> https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users >>>>>>> _______________________________________________ >>>>>>> FastBit-users mailing list >>>>>>> [email protected] >>>>>>> https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users >>>>>> _______________________________________________ >>>>>> FastBit-users mailing list >>>>>> [email protected] >>>>>> https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users >>>>> _______________________________________________ >>>>> FastBit-users mailing list >>>>> [email protected] >>>>> https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users >>>> _______________________________________________ >>>> FastBit-users mailing list >>>> [email protected] >>>> https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users >>> _______________________________________________ >>> FastBit-users mailing list >>> [email protected] >>> https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users >> >> _______________________________________________ >> FastBit-users mailing list >> [email protected] >> https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users > _______________________________________________ > FastBit-users mailing list > [email protected] > https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users _______________________________________________ FastBit-users mailing list [email protected] https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users
