[ 
https://issues.apache.org/jira/browse/NIFI-1156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15094290#comment-15094290
 ] 

ASF GitHub Bot commented on NIFI-1156:
--------------------------------------

Github user markap14 commented on a diff in the pull request:

    https://github.com/apache/nifi/pull/124#discussion_r49481551
  
    --- Diff: 
nifi-nar-bundles/nifi-html-bundle/nifi-html-processors/src/test/java/org/apache/nifi/TestGetHTMLElement.java
 ---
    @@ -0,0 +1,319 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *     http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +package org.apache.nifi;
    +
    +import org.apache.commons.lang3.StringUtils;
    +import org.apache.nifi.flowfile.FlowFile;
    +import org.apache.nifi.processor.ProcessSession;
    +import org.apache.nifi.util.MockFlowFile;
    +import org.apache.nifi.util.TestRunner;
    +import org.apache.nifi.util.TestRunners;
    +import org.junit.Before;
    +import org.junit.Test;
    +
    +import java.lang.Exception;
    +import java.util.List;
    +
    +import static org.junit.Assert.assertTrue;
    +
    +public class TestGetHTMLElement extends AbstractHTMLTest {
    +
    +    private TestRunner testRunner;
    +
    +    @Before
    +    public void init() {
    +        testRunner = TestRunners.newTestRunner(GetHTMLElement.class);
    +        testRunner.setProperty(GetHTMLElement.URL, "http://localhost";);
    +        testRunner.setProperty(GetHTMLElement.OUTPUT_TYPE, 
GetHTMLElement.ELEMENT_HTML);
    +        testRunner.setProperty(GetHTMLElement.DESTINATION, 
GetHTMLElement.DESTINATION_CONTENT);
    +        testRunner.setProperty(GetHTMLElement.HTML_CHARSET, "UTF-8");
    +    }
    +
    +    @Test
    +    public void testNoElementFound() throws Exception {
    +        testRunner.setProperty(GetHTMLElement.CSS_SELECTOR, "b");   //Bold 
element is not present in sample HTML
    +//        testRunner.setProperty(GetHTMLElement.APPEND_ELEMENT_VALUE, "");
    +
    +        ProcessSession session = 
testRunner.getProcessSessionFactory().createSession();
    +        FlowFile ff = writeContentToNewFlowFile(HTML.getBytes(), session);
    +
    +        testRunner.enqueue(ff);
    +        testRunner.run();
    +
    +        testRunner.assertTransferCount(GetHTMLElement.REL_SUCCESS, 0);
    +        testRunner.assertTransferCount(GetHTMLElement.REL_FAILURE, 0);
    +        testRunner.assertTransferCount(GetHTMLElement.REL_NOT_FOUND, 1);
    +    }
    +
    +    @Test
    +    public void testInvalidSelector() throws Exception {
    +        testRunner.setProperty(GetHTMLElement.CSS_SELECTOR, 
"InvalidCSSSelectorSyntax");
    +
    +        ProcessSession session = 
testRunner.getProcessSessionFactory().createSession();
    +        FlowFile ff = writeContentToNewFlowFile(HTML.getBytes(), session);
    +
    +        testRunner.enqueue(ff);
    --- End diff --
    
    This all is a lot easier if we just do testRunner.enqueue(HTML.getBytes()) 
- no need to create a session and write to a FlowFile.


> HTML Parsing Processors Bundle
> ------------------------------
>
>                 Key: NIFI-1156
>                 URL: https://issues.apache.org/jira/browse/NIFI-1156
>             Project: Apache NiFi
>          Issue Type: New Feature
>          Components: Core Framework
>            Reporter: Jeremy Dyer
>            Priority: Minor
>
> NiFi provides the ability to ingest HTML but lacks the convenience to easily 
> interact with that HTML once it has entered the flow. There should be a HTML 
> Processing Bundle that provides mechanisms for manipulating and interacting 
> with HTML data once it has entered the flow. Jsoup http://jsoup.org/ seems 
> like a logical tool to use since it is mature and has a MIT license which 
> would allow it to be incorporated into NiFi.
> “GetHTMLElement” should use the CSS selector-syntax 
> (http://www.w3schools.com/cssref/css_selectors.asp) built into Jsoup to 
> extract 0-N HTML elements from the original HTML input. This processor should 
> support a delimited string of selectors allowing the user to build compound 
> HTML element output. Each HTML element (or compound element result) extracted 
> will create a new Flowfile where the element will be in either the Flowfile 
> content or an attribute depending on the user configuration.
> “ModifyHTMLElement” should provide the ability to modify the original input 
> HTML and overwrite any existing element values. The HTML element that will be 
> modified can be selected by using the CSS selector-syntax
> “PutHTMLElement” should provide the ability to put a new HTML element 
> anywhere in the original input HTML using CSS selector-syntax to indicate the 
> position that the new HTML element should be placed.
> There seems to be a potential for adding more processors but this seems like 
> a good start. Since there is a dependency on Jsoup and a potential for more 
> processors to come I think it makes sense to add this logic as its own nar 
> bundle but I could be wrong.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to