This is an automated email from the ASF dual-hosted git repository. mergebot-role pushed a commit to branch mergebot in repository https://gitbox.apache.org/repos/asf/beam-site.git
commit f8762f6328bcb584e5fc4b4e11bc14e5b870d195 Author: Niels Basjes <[email protected]> AuthorDate: Thu Apr 26 23:08:21 2018 +0200 Moved the 3rd party extensions to a separate page --- src/_includes/section-menu/sdks.html | 1 + src/documentation/sdks/java-extensions.md | 198 ------------------------------ src/documentation/sdks/java-thirdparty.md | 100 +++++++++++++++ src/documentation/sdks/java.md | 2 + 4 files changed, 103 insertions(+), 198 deletions(-) diff --git a/src/_includes/section-menu/sdks.html b/src/_includes/section-menu/sdks.html index faace4e..729258f 100644 --- a/src/_includes/section-menu/sdks.html +++ b/src/_includes/section-menu/sdks.html @@ -9,6 +9,7 @@ alt="External link."></a> </li> <li><a href="{{ site.baseurl }}/documentation/sdks/java-extensions/">Java SDK extensions</a></li> + <li><a href="{{ site.baseurl }}/documentation/sdks/java-thirdparty/">Java 3rd party extensions</a></li> <li><a href="{{ site.baseurl }}/documentation/sdks/java/nexmark/">Nexmark benchmark suite</a></li> </ul> </li> diff --git a/src/documentation/sdks/java-extensions.md b/src/documentation/sdks/java-extensions.md index aeabc9f..7742345 100644 --- a/src/documentation/sdks/java-extensions.md +++ b/src/documentation/sdks/java-extensions.md @@ -58,201 +58,3 @@ PCollection<KV<String, Iterable<KV<String, Integer>>>> groupedAndSorted = grouped.apply( SortValues.<String, String, Integer>create(BufferedExternalSorter.options())); ``` - -## Parsing HTTPD/NGINX access logs. - -The Apache HTTPD webserver creates logfiles that contain valuable information about the requests that have been done to -the webserver. The format of these config files is a configuration option in the Apache HTTPD server so parsing this -into useful data elements is normally very hard to do. - -To solve this problem in an easy way a library was created that works in combination with Apache Beam -and is capable of doing this for both the Apache HTTPD and NGINX. - -The basic idea is that the logformat specification is the schema used to create the line. -THis parser is simply initialized with this schema and the list of fields you want to extract. - -### Basic usage -Full documentation can be found here [https://github.com/nielsbasjes/logparser](https://github.com/nielsbasjes/logparser) - -First you put something like this in your pom.xml file: - - <dependency> - <groupId>nl.basjes.parse.httpdlog</groupId> - <artifactId>httpdlog-parser</artifactId> - <version>5.0</version> - </dependency> - -Check [https://github.com/nielsbasjes/logparser](https://github.com/nielsbasjes/logparser) for the latest version. - -Assume we have a logformat variable that looks something like this: - - String logformat = "%h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\""; - -**Step 1: What CAN we get from this line?** - -To figure out what values we CAN get from this line we instantiate the parser with a dummy class -that does not have ANY @Field annotations or setters. The "Object" class will do just fine for this purpose. - - Parser<Object> dummyParser = new HttpdLoglineParser<Object>(Object.class, logformat); - List<String> possiblePaths = dummyParser.getPossiblePaths(); - for (String path: possiblePaths) { - System.out.println(path); - } - -You will get a list that looks something like this: - - IP:connection.client.host - NUMBER:connection.client.logname - STRING:connection.client.user - TIME.STAMP:request.receive.time - TIME.DAY:request.receive.time.day - TIME.MONTHNAME:request.receive.time.monthname - TIME.MONTH:request.receive.time.month - TIME.YEAR:request.receive.time.year - TIME.HOUR:request.receive.time.hour - TIME.MINUTE:request.receive.time.minute - TIME.SECOND:request.receive.time.second - TIME.MILLISECOND:request.receive.time.millisecond - TIME.ZONE:request.receive.time.timezone - HTTP.FIRSTLINE:request.firstline - HTTP.METHOD:request.firstline.method - HTTP.URI:request.firstline.uri - HTTP.QUERYSTRING:request.firstline.uri.query - STRING:request.firstline.uri.query.* - HTTP.PROTOCOL:request.firstline.protocol - HTTP.PROTOCOL.VERSION:request.firstline.protocol.version - STRING:request.status.last - BYTESCLF:response.body.bytes - HTTP.URI:request.referer - HTTP.QUERYSTRING:request.referer.query - STRING:request.referer.query.* - HTTP.USERAGENT:request.user-agent - -Now some of these lines contain a * . -This is a wildcard that can be replaced with any 'name' if you need a specific value. -You can also leave the '*' and get everything that is found in the actual log line. - -**Step 2 Create the receiving POJO** - -We need to create the receiving record class that is simply a POJO that does not need any interface or inheritance. -In this class we create setters that will be called when the specified field has been found in the line. - -So we can now add to this class a setter that simply receives a single value as specified using the @Field annotation: - - @Field("IP:connection.client.host") - public void setIP(final String value) { - ip = value; - } - -If we really want the name of the field we can also do this - - @Field("STRING:request.firstline.uri.query.img") - public void setQueryImg(final String name, final String value) { - results.put(name, value); - } - -This latter form is very handy because this way we can obtain all values for a wildcard field - - @Field("STRING:request.firstline.uri.query.*") - public void setQueryStringValues(final String name, final String value) { - results.put(name, value); - } - -Instead of using the annotations on the setters we can also simply tell the parser the name of th setter that must be -called when an element is found. - - parser.addParseTarget("setIP", "IP:connection.client.host"); - parser.addParseTarget("setQueryImg", "STRING:request.firstline.uri.query.img"); - parser.addParseTarget("setQueryStringValues", "STRING:request.firstline.uri.query.*"); - -### Example - -Assuming we have a String (being the full log line) comming in and an instance of the WebEvent class comming out -(where the WebEvent already the has the needed setters) the final code when using this in an Apache Beam project -will end up looking something like this - - PCollection<WebEvent> filledWebEvents = input - .apply("Extract Elements from logline", - ParDo.of(new DoFn<String, WebEvent>() { - private Parser<WebEvent> parser; - - @Setup - public void setup() throws NoSuchMethodException { - parser = new HttpdLoglineParser<>(WebEvent.class, getLogFormat()); - parser.addParseTarget("setIP", "IP:connection.client.host"); - parser.addParseTarget("setQueryImg", "STRING:request.firstline.uri.query.img"); - parser.addParseTarget("setQueryStringValues", "STRING:request.firstline.uri.query.*"); - } - - @ProcessElement - public void processElement(ProcessContext c) throws InvalidDissectorException, MissingDissectorsException, DissectionFailure { - c.output(parser.parse(c.element())); - } - }) - ); - -## Analyzing the Useragent string - -This is a java library that tries to parse and analyze the useragent string and extract as many relevant attributes as possible. - -### Basic usage -You can get the prebuilt UDF from maven central. -If you use a maven based project simply add this dependency to your Apache Beam application. - - <dependency> - <groupId>nl.basjes.parse.useragent</groupId> - <artifactId>yauaa-beam</artifactId> - <version>4.2</version> - </dependency> - -Check https://github.com/nielsbasjes/yauaa for the latest version. - -### Example -Assume you have a PCollection with your records. -In most cases I see (clickstream data) these records (in this example this class is called "WebEvent") -contain the useragent string in a field and the parsed results must be added to these fields. - -Now you must do two things: - - 1) Determine the names of the fields you need. Simply call getAllPossibleFieldNamesSorted() to get the list of possible fieldnames you can ask for. - - UserAgentAnalyzer.newBuilder().build() - .getAllPossibleFieldNamesSorted() - .forEach(field -> System.out.println(field)); - -and you get something like this: - - DeviceClass - DeviceName - DeviceBrand - DeviceCpu - DeviceCpuBits - DeviceFirmwareVersion - DeviceVersion - OperatingSystemClass - OperatingSystemName - OperatingSystemVersion - ... - - 2) Add an instance of the (abstract) UserAgentAnalysisDoFn function and implement the functions as shown in the example below. Use the YauaaField annotation to get the setter for the requested fields. - -Note that the name of the two setters is not important, the system looks at the annotation. - - .apply("Extract Elements from Useragent", - ParDo.of(new UserAgentAnalysisDoFn<WebEvent>() { - @Override - public String getUserAgentString(WebEvent record) { - return record.useragent; - } - - @YauaaField("DeviceClass") - public void setDC(WebEvent record, String value) { - record.deviceClass = value; - } - - @YauaaField("AgentNameVersion") - public void setANV(WebEvent record, String value) { - record.agentNameVersion = value; - } - })); - diff --git a/src/documentation/sdks/java-thirdparty.md b/src/documentation/sdks/java-thirdparty.md new file mode 100644 index 0000000..af5f745 --- /dev/null +++ b/src/documentation/sdks/java-thirdparty.md @@ -0,0 +1,100 @@ +--- +layout: section +title: "Beam 3rd Party Java Extensions" +section_menu: section-menu/sdks.html +permalink: /documentation/sdks/java-thirdparty/ +--- +# Apache Beam 3rd Party Java Extensions + +These are some of the 3rd party Java libaries that may be useful for specific applications. + +## Parsing HTTPD/NGINX access logs. + +### Summary +The Apache HTTPD webserver creates logfiles that contain valuable information about the requests that have been done to +the webserver. The format of these log files is a configuration option in the Apache HTTPD server so parsing this +into useful data elements is normally very hard to do. + +To solve this problem in an easy way a library was created that works in combination with Apache Beam +and is capable of doing this for both the Apache HTTPD and NGINX. + +The basic idea is that the logformat specification is the schema used to create the line. +This parser is simply initialized with this schema and the list of fields you want to extract. + +### Project page +[https://github.com/nielsbasjes/logparser](https://github.com/nielsbasjes/logparser) + +### License +Apache License 2.0 + +### Download + <dependency> + <groupId>nl.basjes.parse.httpdlog</groupId> + <artifactId>httpdlog-parser</artifactId> + <version>5.0</version> + </dependency> + +### Code example + +Assuming a WebEvent class that has a the setters setIP, setQueryImg and setQueryStringValues + + PCollection<WebEvent> filledWebEvents = input + .apply("Extract Elements from logline", + ParDo.of(new DoFn<String, WebEvent>() { + private Parser<WebEvent> parser; + + @Setup + public void setup() throws NoSuchMethodException { + parser = new HttpdLoglineParser<>(WebEvent.class, + "%h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\" \"%{Cookie}i\""); + parser.addParseTarget("setIP", "IP:connection.client.host"); + parser.addParseTarget("setQueryImg", "STRING:request.firstline.uri.query.img"); + parser.addParseTarget("setQueryStringValues", "STRING:request.firstline.uri.query.*"); + } + + @ProcessElement + public void processElement(ProcessContext c) throws InvalidDissectorException, MissingDissectorsException, DissectionFailure { + c.output(parser.parse(c.element())); + } + }) + ); + + +## Analyzing the Useragent string + +### Summary +Parse and analyze the useragent string and extract as many relevant attributes as possible. + +### Project page +[https://github.com/nielsbasjes/yauaa](https://github.com/nielsbasjes/yauaa) + +### License +Apache License 2.0 + +### Download + <dependency> + <groupId>nl.basjes.parse.useragent</groupId> + <artifactId>yauaa-beam</artifactId> + <version>4.2</version> + </dependency> + +### Code example + PCollection<WebEvent> filledWebEvents = input + .apply("Extract Elements from Useragent", + ParDo.of(new UserAgentAnalysisDoFn<WebEvent>() { + @Override + public String getUserAgentString(WebEvent record) { + return record.useragent; + } + + @YauaaField("DeviceClass") + public void setDC(WebEvent record, String value) { + record.deviceClass = value; + } + + @YauaaField("AgentNameVersion") + public void setANV(WebEvent record, String value) { + record.agentNameVersion = value; + } + })); + diff --git a/src/documentation/sdks/java.md b/src/documentation/sdks/java.md index 826929e..f5be0fd 100644 --- a/src/documentation/sdks/java.md +++ b/src/documentation/sdks/java.md @@ -33,3 +33,5 @@ The Java SDK has the following extensions: - [join-library]({{site.baseurl}}/documentation/sdks/java-extensions/#join-library) provides inner join, outer left join, and outer right join functions. - [sorter]({{site.baseurl}}/documentation/sdks/java-extensions/#sorter) is an efficient and scalable sorter for large iterables. - [Nexmark]({{site.baseurl}}/documentation/sdks/java/nexmark) is a benchmark suite that runs in batch and streaming modes. + +In addition several [3rd party Java libraries]({{site.baseurl}}/documentation/sdks/java-thirdparty/) exist. \ No newline at end of file -- To stop receiving notification emails like this one, please contact [email protected].
