dpol1 commented on code in PR #1943:
URL: https://github.com/apache/stormcrawler/pull/1943#discussion_r3418782414
##########
core/src/main/java/org/apache/stormcrawler/bolt/JSoupParserBolt.java:
##########
@@ -272,79 +280,91 @@ public void execute(Tuple tuple) {
try {
String html =
Charset.forName(charset).decode(ByteBuffer.wrap(content)).toString();
- jsoupDoc = Parser.htmlParser().parseInput(html, url);
-
- if (!robotsMetaSkip) {
- // extracts the robots directives from the meta tags
- Element robotelement =
jsoupDoc.selectFirst("meta[name~=(?i)robots][content]");
- if (robotelement != null) {
- robotsTags.extractMetaTags(robotelement.attr("content"));
- }
- }
-
- // store a normalised representation in metadata
- // so that the indexer is aware of it
- robotsTags.normaliseToMetadata(metadata);
-
- // do not extract the links if no follow has been set
- // and we are in strict mode
- if (robotsTags.isNoFollow() && robotsNoFollowStrict) {
+ if (isPlainText) {
+ // no markup to parse: the decoded content is the text itself
and
+ // there are no outlinks. An empty shell document is kept so
that
+ // the downstream redirection check and parse filters still
work.
+ jsoupDoc = org.jsoup.nodes.Document.createShell(url);
slinks = new HashMap<>(0);
+ robotsTags.normaliseToMetadata(metadata);
+ text = html;
Review Comment:
I'd go with Option A. A plain-text file is already its own text, which is
the whole point of #466, and the test asserts it's stored verbatim, newlines
and all. Option B runs the content through `appendNormalisedText`, which
collapses whitespace, so logs, source files and tabular dumps lose their
layout. It would also force relaxing that test. For the one format where
whitespace is the content, that's a poor trade.
The downside of Option A, re-reading the two config keys, is smaller than
it looks. For plain text only `no.text` and `skip.after` have any effect;
`include.pattern` and `exclude.tags` need markup, so there's nothing else to
honor. If the custom `textextractor.class` case still bothers you, a cleaner
option is a `text(String)` overload on the extractor that applies those two
limits without normalizing, so both code paths share one implementation. I'd
treat that as a follow-up, not part of this PR.
One open question: should plain text use the `TextExtractor` knobs at all,
or is `http.content.limit` the more honest bound for raw bytes? Either way,
Option A plus a test for the `skip.after` truncation works for me.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]