[
https://issues.apache.org/jira/browse/SOLR-11640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16249893#comment-16249893
]
Jason Gerlowski commented on SOLR-11640:
----------------------------------------
It's possible this is a "feature" and not a "bug". If that's the case, maybe
we should clarify the "File endings considered are...." message output by
SimplePostTool, as it implies a whitelist of file extensions.
> QuickStart Tutorial indexes post.jar, other unexpected files
> ------------------------------------------------------------
>
> Key: SOLR-11640
> URL: https://issues.apache.org/jira/browse/SOLR-11640
> Project: Solr
> Issue Type: Bug
> Security Level: Public(Default Security Level. Issues are Public)
> Components: documentation, scripts and tools
> Affects Versions: master (8.0)
> Reporter: Jason Gerlowski
> Priority: Trivial
>
> Currently, the QuickStart tutorial included in the ref guide involves running
> the following command to index some example documents: {{bin/post -c
> techproducts example/exampledocs/*}}
> This ends up attempting to index _all_ the files in that directory, which
> includes the expected example files, but also as bash script called
> {{test_utf8.sh}} and the {{post.jar}} JAR file itself.
> The subsequent tutorial step involves searching results, which can bring up
> the ugly result:
> {code}
> {
>
> "id":"/home/jason/checkouts/lucene-solr/solr/example/exampledocs/post.jar",
>
> "resourcename":"/home/jason/checkouts/lucene-solr/solr/example/exampledocs/post.jar",
> "content_type":["application/java-archive"],
> "content":[" \n \n \n \n \n \n \n \n \n \n \n
> META-INF/MANIFEST.MF \n Manifest-Version: 1.0\r\nAnt-Version: Apache Ant
> 1.9.6\r\nCreated-By: 1.8.0_151-8u151-b12-0ubuntu0.16.04.2-b12 (Oracle Corp
> orati\r\n on)\r\nMain-Class: org.apache.solr.util.SimplePostTool\r\n\r\n \n\n
> \n \n org/apache/solr/util/RTimer$1.class \n package
> org.apache.solr.util;\n synchronized class RTimer$1 {\n}\n \n\n \n \n o
> rg/apache/solr/util/RTimer$NanoTimeTimerImpl.class \n package
> org.apache.solr.util;\n synchronized class RTimer$NanoTimeTimerImpl
> implements RTimer$TimerImpl {\n private long start ;\n private
> void RTimer$NanoTimeTimerImpl();\n public void start ();\n public
> double elapsed ();\n}\n \n\n \n \n
> org/apache/solr/util/RTimer$TimerImpl.class \n package
> org.apache.solr.util;\n public abstra
> ct interface RTimer$TimerImpl {\n public abstract void start ();\n
> public abstract double elapsed ();\n}\n \n\n \n \n
> org/apache/solr/util/RTimer.class \n package org.apache.solr.util;\n p
> ublic synchronized class RTimer {\n public static final int
> STARTED = 0;\n public static final int STOPPED = 1;\n public
> static final int PAUSED = 2;\n protected int s
> ......[remaining code skipped for brevity]........"],
> "_version_":1583971861929132032},
> {code}
> It's honestly pretty cool that TIKA can extract code from our post.jar file.
> It makes sense, but I didn't expect it. But it's probably not what we
> intended to show to new users. Especially considering that the bin/post
> invocation in the quick-start tutorial claims to be choosy about what
> filetypes it will index:
> {code}
> Entering auto mode. File endings considered are
> xml,json,jsonl,csv,pdf,doc,docx,ppt,pptx,xls,xlsx,odt,odp,ods,ott,otp,ots,rtf,htm,html,txt,log
> {code}
> From a quick glance at things, it looks like {{bin/post}} does pass a list of
> permissible filetypes to the underlying {{SimplePostTool}}, but that
> SimplePostTool doesn't follow this extension whitelist in the particular mode
> being invoked by the quickstart tutorial. So this is probably a wider bug,
> that the quickstart/tutorial just happens to expose.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]