Re: [FastBit-users] Api to handle multi valued column in one row

K. John Wu Fri, 10 Aug 2012 14:35:56 -0700

Hi, Andrew,

I actually contemplated about allowing a set type with subtypes, but
that would require a lot of new code and I simply don't have the time
to do that right now.  Furthermore, dealing with an arbitrary number
of integers might not be more efficient than dealing with an arbitrary
number of bytes.


To index sets of arbitrary integers, we would need to map the integer
users use into something we can use as positions of bitmaps.  This
might not be any better than a dictionary that maps strings to integers.

Likely the biggest source of time consumption is from have to read the
dictionary in before doing anything useful.  This time can eliminated
by either answering many questions for one dictionary creation, or to
normalize the data table (i.e., producing the cross-product version of
the data).

Of course, to really address this issue, we should do a real timing
measurement..

John



On 8/10/12 1:56 PM, Olson, Andrew wrote:
> Hi John & Kishore,
> I think it would be useful to be able to have columns that are set of 
> (signed/unsigned integer types, float, double, category).  These could be 
> stored as strings, but indexed according to the member type.  You would need 
> to specify a subtype in the --part.txt along with the delimiter (defaults to 
> double and ',').  I haven't looked at the code, but I imagine it wouldn't be 
> too hard to construct the usual bitmap index from a list of lists rather than 
> single values.  Then, queries that access the bitmap index should work out of 
> the box.
> Andrew
> 
> 
> On Aug 10, 2012, at 4:29 PM, kishore g wrote:
> 
> Hi John,
> 
> I actually prefer the column as text with delimiter approach instead of 
> creating the cross product. I want to understand why would there be a 
> performance impact. Will it impact all types of queries or only few. My 
> understanding is if i need to only get the number of hits( based on where 
> clause) and if aggregation functions( sum, count(*)) are on single values 
> columns there should be no performance impact.
> 
> Please correct me if my understanding is correct.
> 
> Thanks,
> Kishore G
> 
> On Fri, Aug 10, 2012 at 1:20 PM, K. John Wu 
> <[email protected]<mailto:[email protected]>> wrote:
> Hi, Kishore,
> 
> I would love to see you try the new set-value feature, however, since
> you are more concerned about the query performance, I will suggest
> that you create the cross-product out of the set-values.
> 
> Turn
> 
> a:1,2,3  b:23,343 c:23,123 metric1:12 metric2:34
> 
> into 12 rows as follows
> 
> a:1  b:23 c:23 metric1:12 metric2:34
> a:2  b:23 c:23 metric1:12 metric2:34
> a:3  b:23 c:23 metric1:12 metric2:34
> a:1  b:343 c:23 metric1:12 metric2:34
> a:2  b:343 c:23 metric1:12 metric2:34
> a:3  b:343 c:23 metric1:12 metric2:34
> a:1  b:23 c:123 metric1:12 metric2:34
> a:2  b:23 c:123 metric1:12 metric2:34
> a:3  b:23 c:123 metric1:12 metric2:34
> a:1  b:343 c:123 metric1:12 metric2:34
> a:2  b:343 c:123 metric1:12 metric2:34
> a:3  b:343 c:123 metric1:12 metric2:34
> 
> In this form, all a, b, and c are simple integers. Queries on them
> should be pretty fast.
> 
> John
> 
> 
> 
> On 8/10/12 12:48 PM, kishore g wrote:
>> Thanks a lot. A follow up question on this is how will this impact
>> performance and if there is an overhead of maintaining additional
>> dictionary. Because we go through the pain of converting actual
>> strings into int to speed up performance and avoid dictionary over
>> head but converting them back into string does look like a good option
>>
>> thanks,
>> Kishore G
>>
>> On Fri, Aug 10, 2012 at 11:25 AM, K. John Wu 
>> <[email protected]<mailto:[email protected]>
>> <mailto:[email protected]<mailto:[email protected]>>> wrote:
>>
>>     Hi, Kishore,
>>
>>     Funny thing you should ask this.  We have just implemented something
>>     that might do it for you.  Here is how it might go.
>>
>>     The way I interpret the row of data you have is as follows,
>>
>>     The column named 'a' contains a set of integers: 1, 2, and 3.
>>     The column named 'b' contains a set of integers: 23, and 343.
>>     The column named 'c' contains a set of integers: 23, and 123.
>>
>>     Currently, the only way to support a column with set-values (i.e.,
>>     each value is a set of something, as oppose to a simple value) is to
>>     treat the set as a string.  We have to treat it as a string because it
>>     is the only type that support variable-length data.  If you tell
>>     FastBit columns 'a', 'b', and 'c' are ibis::TEXT and then tell ibis to
>>     build indexes with the following option
>>
>>     keywords delimiters=','
>>
>>     The above option tell ibis.cpp to build keyword indexes for the text
>>     data and sets the delimiter to be coma ','.
>>
>>     You need to modify your query expression a little bit, "a = 1" needs
>>     to be "a contains '1'".  Literally, this means the value '1' is in the
>>     set represented by 'a'.  The quote around 1 is necessary because it
>>     would otherwise be interpreted as an integer.
>>
>>     "Why do I have to go through this much trouble?" you might ask.  Well,
>>     the problem is that SQL does not support set-values.  In fact, it does
>>     not support any sort of composite values.  This is why you have hide
>>     set-values as arbitrary text.  If you don't want to go through the
>>     above trouble, you might consider make a cross products of all 12
>>     possible combinations that is expressed in the single row in your
>>     example.  Expressing the single as 12 rows might still give you faster
>>     operations because FastBit will be able to treat integers properly as
>>     integers instead of as strings.
>>
>>     Hope this helps.
>>
>>     John
>>
>>
>>
>>     PS: one way to tell ibis.cpp to build indexes for columns 'a', 'b',
>>     'c' with the above mentioned indexing option is to use the following
>>     command line option
>>
>>     -b "a:keywords delimiters=','" -b "b:keywords delimiters=','" -b
>>     "c:keywords delimiters=','"
>>
>>     or
>>
>>     -b "?:keywords delimiters=','"
>>
>>     (the above option assume that columns 'a', 'b', and 'c' are the only
>>     ones with a single-letter column names)
>>
>>
>>     On 8/10/12 8:50 AM, kishore g wrote:
>>     > Hi,
>>     >
>>     > Is there an example on how to set multiple values for a column
>>     for one
>>     > row.
>>     >
>>     > For example
>>     >
>>     > a:1,2,3  b:23,343 c:23,123 metric1:12 metric2:34
>>     >
>>     > The following queries should include this row in the result
>>     >
>>     > select from table where a=1
>>     > select from table where a=2
>>     > select from table where a=1 and b=23
>>     > select from table where a=1 or b=99 or c=89
>>     >
>>     > thanks,
>>     > Kishore G
>>     >
>>     >
>>     > _______________________________________________
>>     > FastBit-users mailing list
>>     > [email protected]<mailto:[email protected]> 
>> <mailto:[email protected]<mailto:[email protected]>>
>>     > https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users
>>     >
>>
>>
> 
> _______________________________________________
> FastBit-users mailing list
> [email protected]<mailto:[email protected]>
> https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users
> 
> _______________________________________________
> FastBit-users mailing list
> [email protected]
> https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users
> 
_______________________________________________
FastBit-users mailing list
[email protected]
https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users

Re: [FastBit-users] Api to handle multi valued column in one row

Reply via email to