cgivre commented on pull request #2112:
URL: https://github.com/apache/drill/pull/2112#issuecomment-731204034


   > I did some testing and found something worth discussing regarding the 
wildcards.
   > 
   > _Note about all of these points; I'm fine with just putting a bit of 
documentation in place that describes these as known limitations._
   > 
   > When I do a "select *" from a table backed by this format and I print the 
result set I get for "wildcard" scenarios like the query parameters and the 
cookies options like these:
   > 
   > ```
   > `response_cookies_$` STRUCT<`apache` VARCHAR>,
   > `request_firstline_uri_query_$` STRUCT<`aap` VARCHAR, `res` VARCHAR>,
   > ```
   
   That is the intended behavior. What should happen is that Drill will create 
a map of the parsed cookies and uri query.  If you don't think this is the most 
effective way of doing this, I'm definitely open to refactoring it.  
   
   Just as an FYI, I only chose to do it this way because that's how it was 
done in the original Drill/HTTPD integration.   It might be better to flatten 
these maps and produce actual columns with the values. 
   
   > 
   > The first thing I noticed is that the actual values in the data are 
reflected in the header. I assume this is just the way the RowSet::print() 
works. Do note that if you have a large variety of query parameters in your 
dataset this may become a big list.
   > 
   
   That is correct.
   
   > What I find is that these wildcards do not work as I expected when 
comparing what the underlying parser does.
   > 
   > Assuming the URI `/icons/powered_by_rh.png?aap=noot&res=1024x768`
   > 
   > When I ask for `request_firstline_uri_query_$` I see in the output 
something that looks like what I expect `{"noot", "1024x768"}`
   > However when I directly try to query a deeper entry like 
`request_firstline_uri_query_aap` I consistently see a `null` value.
   > 
   > This "explicit" way of asking for a values is there because now the system 
does not need to url decode the "unwanted" fields (i.e. there is a bit of 
performance impact if there are a lot of unwanted fields (query parameters / 
cookies) in the line at hand.
   > 
   
   The way Drill works is that it creates a vector for every column it finds.   
So if you have a URL with params `field1` and `field2`, you'll get vectors 
(regardless of whether they are in a map or not) of `field1` and `field2`. 
   
   Now, if the next record has `field2` and `field3` the result will be that 
the `field1` will be `null` for row2 but fields2 and 3 will be populated. 
   
   > Note that the underlying parser does support this; the example for Apache 
Pig makes this the most clear:
   > 
https://github.com/nielsbasjes/logparser/blob/master/examples/apache-pig/src/main/pig/demo.pig#L34
   > 
   > Now the response cookies are special because they have limited support for 
a wildcard in the middle:
   > 
   > ```
   > `response_cookies_$_comment` VARCHAR,
   > `response_cookies_$_domain` VARCHAR,
   > `response_cookies_$_expires` TIMESTAMP,
   > `response_cookies_$_path` VARCHAR,
   > `response_cookies_$_value` VARCHAR,
   > ```
   > 
   > See 
https://github.com/nielsbasjes/logparser/blob/master/httpdlog/httpdlog-parser/src/test/java/nl/basjes/parse/httpdlog/ApacheHttpdLogParserTest.java#L161
   > 
   > These are intended so you can ask for something 
like`STRING:response.cookies.jsessionid.path`
   > 
   > Here I found that these seem to always return a null also.
   
   What I think you're getting at here is it might be advantageous to flatten 
the wildcard fields rather than putting them in a Drill map and in so doing, 
create many null columns.  Is that correct?  If so, my thought here is that the 
best way to go about that would be to add a config option called 
`flattenWildcardFields` and if the user selects that, you would get a column 
for every value in the wildcard fields rather than a map. 
   
   The advantage that I see in doing this is easier queries. For instance if 
you wanted to find particular values from a query string, you could do 
something like:
   
   ```sql
   SELECT <fields>
   FROM ...
   WHERE request_firstline_uri_query_aap = 1234
   ```
   
   Would that work for you?
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to