Re: [PR] #466 - Handle text/plain content in JSoupParserBolt [stormcrawler]

via GitHub Tue, 16 Jun 2026 02:24:42 -0700


rzo1 commented on code in PR #1943:
URL: https://github.com/apache/stormcrawler/pull/1943#discussion_r3419547766



##########
core/src/main/java/org/apache/stormcrawler/bolt/JSoupParserBolt.java:
##########
@@ -272,79 +280,91 @@ public void execute(Tuple tuple) {
         try {
             String html = 
Charset.forName(charset).decode(ByteBuffer.wrap(content)).toString();
 
-            jsoupDoc = Parser.htmlParser().parseInput(html, url);
-
-            if (!robotsMetaSkip) {
-                // extracts the robots directives from the meta tags
-                Element robotelement = 
jsoupDoc.selectFirst("meta[name~=(?i)robots][content]");
-                if (robotelement != null) {
-                    robotsTags.extractMetaTags(robotelement.attr("content"));
-                }
-            }
-
-            // store a normalised representation in metadata
-            // so that the indexer is aware of it
-            robotsTags.normaliseToMetadata(metadata);
-
-            // do not extract the links if no follow has been set
-            // and we are in strict mode
-            if (robotsTags.isNoFollow() && robotsNoFollowStrict) {
+            if (isPlainText) {
+                // no markup to parse: the decoded content is the text itself 
and
+                // there are no outlinks. An empty shell document is kept so 
that
+                // the downstream redirection check and parse filters still 
work.
+                jsoupDoc = org.jsoup.nodes.Document.createShell(url);
                 slinks = new HashMap<>(0);
+                robotsTags.normaliseToMetadata(metadata);
+                text = html;

Review Comment:
   I went for Option A now. Happy for re-review.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] #466 - Handle text/plain content in JSoupParserBolt [stormcrawler]

Reply via email to