rzo1 commented on code in PR #1943:
URL: https://github.com/apache/stormcrawler/pull/1943#discussion_r3415389865
##########
core/src/main/java/org/apache/stormcrawler/bolt/JSoupParserBolt.java:
##########
@@ -272,79 +280,91 @@ public void execute(Tuple tuple) {
try {
String html =
Charset.forName(charset).decode(ByteBuffer.wrap(content)).toString();
- jsoupDoc = Parser.htmlParser().parseInput(html, url);
-
- if (!robotsMetaSkip) {
- // extracts the robots directives from the meta tags
- Element robotelement =
jsoupDoc.selectFirst("meta[name~=(?i)robots][content]");
- if (robotelement != null) {
- robotsTags.extractMetaTags(robotelement.attr("content"));
- }
- }
-
- // store a normalised representation in metadata
- // so that the indexer is aware of it
- robotsTags.normaliseToMetadata(metadata);
-
- // do not extract the links if no follow has been set
- // and we are in strict mode
- if (robotsTags.isNoFollow() && robotsNoFollowStrict) {
+ if (isPlainText) {
+ // no markup to parse: the decoded content is the text itself
and
+ // there are no outlinks. An empty shell document is kept so
that
+ // the downstream redirection check and parse filters still
work.
+ jsoupDoc = org.jsoup.nodes.Document.createShell(url);
slinks = new HashMap<>(0);
+ robotsTags.normaliseToMetadata(metadata);
+ text = html;
Review Comment:
Good catch. The plain-text branch does `text = html` directly, so neither
`textextractor.no.text` nor `textextractor.skip.after` applies, and with the
default `http.content.limit: -1` a large `.txt` is stored in full. I see two
ways to bound it. Both reuse the existing `TextExtractor` config keys so there
is no new knob to document.
### Option A: truncate the raw decoded string
Read the two limits in `prepare()` and apply them in the branch:
```java
// fields
private boolean plainTextNoText;
private int plainTextMaxSize;
// prepare()
plainTextNoText = ConfUtils.getBoolean(conf,
TextExtractor.NO_TEXT_PARAM_NAME, false);
plainTextMaxSize = ConfUtils.getInt(conf,
TextExtractor.TEXT_MAX_TEXT_PARAM_NAME, -1);
// in the isPlainText branch
if (plainTextNoText) {
text = "";
} else if (plainTextMaxSize > 0 && html.length() > plainTextMaxSize) {
text = html.substring(0, plainTextMaxSize);
} else {
text = html;
}
```
**Pros:** preserves the original plain-text formatting (newlines, spacing,
indentation), which is usually the point of a `.txt`; trivially cheap.
**Cons:** re-reads config keys that conceptually belong to the
`TextExtractor`, so a custom `textextractor.class` with different semantics
would not be consulted; the cap is on character count, not on the extractor's
accumulation logic.
### Option B: route the text through the configured extractor
Wrap the decoded content in the shell document and let the extractor produce
the text:
```java
jsoupDoc = Document.createShell(url);
jsoupDoc.body().appendChild(new TextNode(html));
slinks = new HashMap<>(0);
robotsTags.normaliseToMetadata(metadata);
text = textExtractor.text(jsoupDoc.body());
```
**Pros:** a single text-extraction path for HTML and plain text;
automatically honors `no.text`, `skip.after`, exclude tags, and any custom
`textextractor.class`.
**Cons:** the extractor normalizes whitespace (`appendNormalisedText`),
collapsing newlines and runs of spaces, so the stored plain text loses its
original layout. Fine for indexing, lossy if the raw structure matters.
### Trade-off in one line
A keeps plain text faithful but special-cases the limits; B unifies the code
path but mangles whitespace. Either way I would add a test asserting a
`>skip.after` input is truncated.
WDYT?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]