Hi Charles,
Two comments.
First, Drill "maps" are actually structs (nested tuples): every record must
have the same set of columns within the "map." That is, though the Drill type
is called a "map", and you might assume that, given that name, it would act
like a JSON, Python of Java map, the actual implementation is, in fact, a
struct. (I saw a JIRA ticket to rename the Map type in some context because of
this unfortunate mismatch of name and implementation.)
By contrast, Hive defines both Map and Struct types. A Drill "Map" is like a
Hive Struct, and Drill has no equivalent of a Hive Map. Still, there are
solutions.
To use a single parsed_packet map column, you'd have to know the union of all
the columns you'll create across all the packet types and define a map schema
that includes all these columns. Define this map in all batches so you have a
consistent schema. This means including all columns for all packet types, even
if the data does not happen to have all packet types.
Or, you could define a different map for each packet type; but you'd still have
to define the needed ones up front. You could do this if you had columns
called, say, parsed_x_packet, parsed_y_packet, etc. If that packet type is
projected (appears in the SELECT ... clause), then define the required schema
for all records. The user just selects the packet types of interest.
This brings us to the second comment. The long work to merge the row set
framework into Drill is coming to a close, and it is now available for you to
use. The row set framework provides a very simple way to define your map
schemas (once you know what they are). It also handles projection:the user
selects some of your parsed packets, but not others, or projects some of the
packet map columns, but not others.
Drill 1.16 migrates the CSV reader to the new framework (where it also supports
user-defined schemas and type conversions.) The next step in the row set work
is to migrate a few other readers to the new framework. Perhaps, PCAP might be
a good candidate to enable your new packet-parsing feature.
Thanks,
- Paul
On Tuesday, April 23, 2019, 9:34:16 AM PDT, Charles Givre
<[email protected]> wrote:
Hello all,
I saw a few open source libraries that parse actual packet content and was
interested in incorporating this into Drill's PCAP parser. I was thinking
initially of writing this as a UDF, however, I think it would be much better to
include this directly in Drill. What I was thinking was to create a field
called parsed_packet that would be a Drill Map. The contents of this field
would vary depending on the type of packet. For instance, if it is a DNS
packet, you get all the DNS info, ICMP etc...
Does the community think this is a good idea? Also, given the structure of the
PCAP plugin, I'm not quite sure how to create a Map field with variable
contents. Are there any examples that use the same architecture as the PCAP
plugin?
Thanks,
-- C