[GitHub] [incubator-druid] nosahama commented on issue #2523: Support multiple lookups within one namespace

GitBox Wed, 08 May 2019 15:14:31 -0700

nosahama commented on issue #2523: Support multiple lookups within one namespace
URL: 
https://github.com/apache/incubator-druid/issues/2523#issuecomment-490667638
 
 
   > As requested I am sharing our use case. We're using a TSV in S3 for a 
namespace lookup (at least to start with, we will probably switch over to a 
JDBC source eventually). We have a single key column, which always corresponds 
to the same actual dimension in Druid. We have a dozen lookup columns (could 
grow by a handful, but I'd think no more than 20). And we're starting pretty 
small now with only about 100K rows, but expect that could grow to several 
million rows before too long.
   > 
   > We don't need this updated really frequently. Actually we're still working 
out our ETLs and so forth to deal with revisions and additions to the lookup 
data. But I wouldn't expect us to have updates more frequently than hourly, and 
probably more like daily.
   > 
   > As far as pain points with this arrangement - there is sure plenty of 
boilerplate in the config. I have an array of a dozen entries in 
`druid.query.extraction.namespace.lookups` that are identical in all fields 
except for `namespace` and `valueColumn`. A bit clunky but not so much that I'd 
complain about it really - I did write a couple of simple scripts that generate 
the stuff to be placed in config.
   > 
   > I'm more concerned about the overhead when we do update the lookup source. 
Druid will have to load and parse this (potentially sized) 20 x 3M TSV once per 
lookup. I haven't done any benchmarking but I have noticed that it can take on 
the order of 15 seconds to completely load our current 12 x 100K case. Even if 
it takes a few minutes that is not a gamebreaker (assuming it does not 
interfere with query performance or produce inconsistent results while in 
progress). But it certainly seems like it could be a lot more efficient to load 
and parse the file once instead of 12 or 20 times.
   > 
   > Overall, the configuration and use feels a bit clunky, I think because 
from the user point of view, we have just one "lookup namespace" - there is a 
single source, and a single key column. It would feel more natural to define 
the data source level properties (uri, format, columns) and key column once, 
along with a list of allowed targetColumns, then use it in dimension specs and 
filters by referencing just the one single namespace plus a targetColumn. It 
might start to look like an ingestion spec at that point, with dataSchema- and 
ioConfig-like sections.
   > 
   > But honestly I don't know how much of a priority I'd want it to be. 
Associating a single namespace with a single key column and multiple value 
columns might well be overfitting to our specific case, and it 's certainly 
quite usable as it stands.
   > 
   > (One side note, the ability to include columns in the CSV which are not 
key or value columns is useful for assembling the data manually - we can 
include "friendly name" sort of columns that are helpful to people who are 
filling in or auditing the actual lookup data.)
   
   Hi there, please i am trying to configure Druid to load a lookup file from 
s3, how do i do this? Do i use the `file:/` syntax or there is another syntax 
for loading lookups from s3?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [incubator-druid] nosahama commented on issue #2523: Support multiple lookups within one namespace

Reply via email to