Jason Gerlowski created SOLR-11640:
--------------------------------------
Summary: QuickStart Tutorial indexes post.jar, other unexpected
files
Key: SOLR-11640
URL: https://issues.apache.org/jira/browse/SOLR-11640
Project: Solr
Issue Type: Bug
Security Level: Public (Default Security Level. Issues are Public)
Components: documentation, scripts and tools
Affects Versions: master (8.0)
Reporter: Jason Gerlowski
Priority: Trivial
Currently, the QuickStart tutorial included in the ref guide involves running
the following command to index some example documents: {{bin/post -c
techproducts example/exampledocs/*}}
This ends up attempting to index _all_ the files in that directory, which
includes the expected example files, but also as bash script called
{{test_utf8.sh}} and the {{post.jar}} JAR file itself.
The subsequent tutorial step involves searching results, which can bring up the
ugly result:
{code}
{
"id":"/home/jason/checkouts/lucene-solr/solr/example/exampledocs/post.jar",
"resourcename":"/home/jason/checkouts/lucene-solr/solr/example/exampledocs/post.jar",
"content_type":["application/java-archive"],
"content":[" \n \n \n \n \n \n \n \n \n \n \n
META-INF/MANIFEST.MF \n Manifest-Version: 1.0\r\nAnt-Version: Apache Ant
1.9.6\r\nCreated-By: 1.8.0_151-8u151-b12-0ubuntu0.16.04.2-b12 (Oracle Corp
orati\r\n on)\r\nMain-Class: org.apache.solr.util.SimplePostTool\r\n\r\n \n\n
\n \n org/apache/solr/util/RTimer$1.class \n package
org.apache.solr.util;\n synchronized class RTimer$1 {\n}\n \n\n \n \n o
rg/apache/solr/util/RTimer$NanoTimeTimerImpl.class \n package
org.apache.solr.util;\n synchronized class RTimer$NanoTimeTimerImpl
implements RTimer$TimerImpl {\n private long start ;\n private
void RTimer$NanoTimeTimerImpl();\n public void start ();\n public
double elapsed ();\n}\n \n\n \n \n
org/apache/solr/util/RTimer$TimerImpl.class \n package
org.apache.solr.util;\n public abstra
ct interface RTimer$TimerImpl {\n public abstract void start ();\n
public abstract double elapsed ();\n}\n \n\n \n \n
org/apache/solr/util/RTimer.class \n package org.apache.solr.util;\n p
ublic synchronized class RTimer {\n public static final int
STARTED = 0;\n public static final int STOPPED = 1;\n public
static final int PAUSED = 2;\n protected int s
......[remaining code skipped for brevity]........"],
"_version_":1583971861929132032},
{code}
It's honestly pretty cool that TIKA can extract code from our post.jar file.
It makes sense, but I didn't expect it. But it's probably not what we intended
to show to new users. Especially considering that the bin/post invocation in
the quick-start tutorial claims to be choosy about what filetypes it will index:
{code}
Entering auto mode. File endings considered are
xml,json,jsonl,csv,pdf,doc,docx,ppt,pptx,xls,xlsx,odt,odp,ods,ott,otp,ots,rtf,htm,html,txt,log
{code}
>From a quick glance at things, it looks like {{bin/post}} does pass a list of
>permissible filetypes to the underlying {{SimplePostTool}}, but that
>SimplePostTool doesn't follow this extension whitelist in the particular mode
>being invoked by the quickstart tutorial. So this is probably a wider bug,
>that the quickstart/tutorial just happens to expose.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]