rzo1 commented on code in PR #1943:
URL: https://github.com/apache/stormcrawler/pull/1943#discussion_r3415389865


##########
core/src/main/java/org/apache/stormcrawler/bolt/JSoupParserBolt.java:
##########
@@ -272,79 +280,91 @@ public void execute(Tuple tuple) {
         try {
             String html = 
Charset.forName(charset).decode(ByteBuffer.wrap(content)).toString();
 
-            jsoupDoc = Parser.htmlParser().parseInput(html, url);
-
-            if (!robotsMetaSkip) {
-                // extracts the robots directives from the meta tags
-                Element robotelement = 
jsoupDoc.selectFirst("meta[name~=(?i)robots][content]");
-                if (robotelement != null) {
-                    robotsTags.extractMetaTags(robotelement.attr("content"));
-                }
-            }
-
-            // store a normalised representation in metadata
-            // so that the indexer is aware of it
-            robotsTags.normaliseToMetadata(metadata);
-
-            // do not extract the links if no follow has been set
-            // and we are in strict mode
-            if (robotsTags.isNoFollow() && robotsNoFollowStrict) {
+            if (isPlainText) {
+                // no markup to parse: the decoded content is the text itself 
and
+                // there are no outlinks. An empty shell document is kept so 
that
+                // the downstream redirection check and parse filters still 
work.
+                jsoupDoc = org.jsoup.nodes.Document.createShell(url);
                 slinks = new HashMap<>(0);
+                robotsTags.normaliseToMetadata(metadata);
+                text = html;

Review Comment:
   Good catch. The plain-text branch does `text = html` directly, so neither 
`textextractor.no.text` nor `textextractor.skip.after` applies, and with the 
default `http.content.limit: -1` a large `.txt` is stored in full. I see two 
ways to bound it. Both reuse the existing `TextExtractor` config keys so there 
is no new knob to document.
   
   ### Option A: truncate the raw decoded string
   
   Read the two limits in `prepare()` and apply them in the branch:
   
   ```java
   // fields
   private boolean plainTextNoText;
   private int plainTextMaxSize;
   
   // prepare()
   plainTextNoText  = ConfUtils.getBoolean(conf, 
TextExtractor.NO_TEXT_PARAM_NAME, false);
   plainTextMaxSize = ConfUtils.getInt(conf, 
TextExtractor.TEXT_MAX_TEXT_PARAM_NAME, -1);
   
   // in the isPlainText branch
   if (plainTextNoText) {
       text = "";
   } else if (plainTextMaxSize > 0 && html.length() > plainTextMaxSize) {
       text = html.substring(0, plainTextMaxSize);
   } else {
       text = html;
   }
   ```
   
   **Pros:** preserves the original plain-text formatting (newlines, spacing, 
indentation), which is usually the point of a `.txt`; trivially cheap.
   **Cons:** re-reads config keys that conceptually belong to the 
`TextExtractor`, so a custom `textextractor.class` with different semantics 
would not be consulted; the cap is on character count, not on the extractor's 
accumulation logic.
   
   ### Option B: route the text through the configured extractor
   
   Wrap the decoded content in the shell document and let the extractor produce 
the text:
   
   ```java
   jsoupDoc = Document.createShell(url);
   jsoupDoc.body().appendChild(new TextNode(html));
   slinks = new HashMap<>(0);
   robotsTags.normaliseToMetadata(metadata);
   text = textExtractor.text(jsoupDoc.body());
   ```
   
   **Pros:** a single text-extraction path for HTML and plain text; 
automatically honors `no.text`, `skip.after`, exclude tags, and any custom 
`textextractor.class`.
   **Cons:** the extractor normalizes whitespace (`appendNormalisedText`), 
collapsing newlines and runs of spaces, so the stored plain text loses its 
original layout. Fine for indexing, lossy if the raw structure matters.
   
   ### Trade-off in one line
   
   A keeps plain text faithful but special-cases the limits; B unifies the code 
path but mangles whitespace. Either way I would add a test asserting a 
`>skip.after` input is truncated.
   
   WDYT?
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to