Copilot commented on code in PR #3784:
URL: https://github.com/apache/solr/pull/3784#discussion_r2437788710
##########
solr/solr-ref-guide/modules/indexing-guide/pages/indexing-with-tika.adoc:
##########
@@ -54,27 +54,21 @@ This is provided via the `extraction`
xref:configuration-guide:solr-modules.adoc
The "techproducts" example included with Solr is pre-configured to have Solr
Cell configured.
If you are not using the example, you will want to pay attention to the
section <<solrconfig.xml Configuration>> below.
-== Tika Extraction Backends
+== Tika Server Backend
-There are two backends for this module. The `local` backend embeds Tika inside
Solr's own process, while the `tikaserver` backend uses an external Tika server
process to do the extraction.
-
-=== Tika Server
-
-The `tikaserver` backend lets Solr delegate content extraction to an external
Apache Tika Server process instead of running Tika parsers inside the Solr JVM.
This can improve operational isolation (crashes or heavy parsing won’t impact
Solr), simplify dependency management, and allow you to scale Tika
independently of Solr.
+Solr delegates content extraction to an external Apache Tika Server process.
This provides operational isolation (crashes or heavy parsing won't impact
Solr), simplifies dependency management, and allows you to scale Tika
independently of Solr.
Example handler configuration:
[source,xml]
----
<requestHandler name="/update/extract"
class="solr.extraction.ExtractingRequestHandler">
- <!-- Select the tikaserver backend by default for this handler -->
- <str name="extraction.backend">tikaserver</str>
- <!-- Point Solr to your Tika Server -->
+ <!-- Point Solr to your Tika Server (required) -->
<str name="tikaserver.url">http://localhost:9998</str>
</requestHandler>
----
-==== Starting Tika Server with Docker
+=== Starting Tika Server with Docker
The quickest way to run Tika Server for development is using Docker. The
examples below expose Tika on port 9998 on localhost, matching the default
value when `tikaserver.url` is not explicitly set.
Review Comment:
`tikaserver.url` is now required by the handler and no default is assumed.
Please remove the sentence about a default value and state that the example
uses port 9998 for convenience.
```suggestion
The quickest way to run Tika Server for development is using Docker. The
examples below expose Tika on port 9998 on localhost for convenience, matching
the handler configuration above.
```
##########
solr/modules/extraction/src/java/org/apache/solr/handler/extraction/ExtractingRequestHandler.java:
##########
@@ -60,84 +54,61 @@ public PermissionNameProvider.Name
getPermissionName(AuthorizationContext reques
@Override
public void inform(SolrCore core) {
try {
- // Store tika config location (backend-specific)
- this.tikaConfigLoc = (String) initArgs.get(CONFIG_LOCATION);
-
- String parseContextConfigLoc = (String)
initArgs.get(PARSE_CONTEXT_CONFIG);
- if (parseContextConfigLoc == null) { // default:
- parseContextConfig = new ParseContextConfig();
- } else {
- parseContextConfig =
- new ParseContextConfig(core.getResourceLoader(),
parseContextConfigLoc);
+ // Create Tika Server backend - now the only supported backend
+ String tikaServerUrl = (String)
initArgs.get(ExtractingParams.TIKASERVER_URL);
+ if (tikaServerUrl == null || tikaServerUrl.trim().isEmpty()) {
+ throw new SolrException(
+ ErrorCode.SERVER_ERROR,
+ "Tika Server URL must be configured via '"
+ + ExtractingParams.TIKASERVER_URL
+ + "' parameter");
}
- // Always create local backend
- this.localBackend = new LocalTikaExtractionBackend(core, tikaConfigLoc,
parseContextConfig);
-
- // Optionally create Tika Server backend if URL configured
- String tikaServerUrl = (String)
initArgs.get(ExtractingParams.TIKASERVER_URL);
- if (tikaServerUrl != null && !tikaServerUrl.trim().isEmpty()) {
- int timeoutSecs = 0;
- Object initTimeout =
initArgs.get(ExtractingParams.TIKASERVER_TIMEOUT_SECS);
- if (initTimeout != null) {
- try {
- timeoutSecs = Integer.parseInt(String.valueOf(initTimeout));
- } catch (NumberFormatException nfe) {
- throw new SolrException(
- ErrorCode.SERVER_ERROR,
- "Invalid value for '"
- + ExtractingParams.TIKASERVER_TIMEOUT_SECS
- + "': "
- + initTimeout,
- nfe);
- }
+ int timeoutSecs = 0;
+ Object initTimeout =
initArgs.get(ExtractingParams.TIKASERVER_TIMEOUT_SECS);
+ if (initTimeout != null) {
+ try {
+ timeoutSecs = Integer.parseInt(String.valueOf(initTimeout));
+ } catch (NumberFormatException nfe) {
+ throw new SolrException(
+ ErrorCode.SERVER_ERROR,
+ "Invalid value for '"
+ + ExtractingParams.TIKASERVER_TIMEOUT_SECS
+ + "': "
+ + initTimeout,
+ nfe);
}
- Object maxCharsObj =
initArgs.get(ExtractingParams.TIKASERVER_MAX_CHARS);
- long maxCharsLimit =
TikaServerExtractionBackend.DEFAULT_MAXCHARS_LIMIT;
- if (maxCharsObj != null) {
- try {
- maxCharsLimit = Long.parseLong(String.valueOf(maxCharsObj));
- } catch (NumberFormatException nfe) {
- throw new SolrException(
- ErrorCode.SERVER_ERROR,
- "Invalid value for '"
- + ExtractingParams.TIKASERVER_MAX_CHARS
- + "': "
- + maxCharsObj);
- }
+ }
+ Object maxCharsObj = initArgs.get(ExtractingParams.TIKASERVER_MAX_CHARS);
+ long maxCharsLimit = TikaServerExtractionBackend.DEFAULT_MAXCHARS_LIMIT;
+ if (maxCharsObj != null) {
+ try {
+ maxCharsLimit = Long.parseLong(String.valueOf(maxCharsObj));
+ } catch (NumberFormatException nfe) {
+ throw new SolrException(
+ ErrorCode.SERVER_ERROR,
+ "Invalid value for '" + ExtractingParams.TIKASERVER_MAX_CHARS +
"': " + maxCharsObj);
}
- this.tikaServerBackend =
- new TikaServerExtractionBackend(tikaServerUrl, timeoutSecs,
initArgs, maxCharsLimit);
}
+ this.tikaServerBackend =
+ new TikaServerExtractionBackend(tikaServerUrl, timeoutSecs,
initArgs, maxCharsLimit);
// Choose default backend name
String backendName = (String)
initArgs.get(ExtractingParams.EXTRACTION_BACKEND);
this.defaultBackendName =
(backendName == null || backendName.trim().isEmpty())
- ? LocalTikaExtractionBackend.NAME
+ ? TikaServerExtractionBackend.NAME
: backendName;
- // Validate backend and check configuration
- switch (this.defaultBackendName) {
- case LocalTikaExtractionBackend.NAME:
- break;
- case TikaServerExtractionBackend.NAME:
- // Tika Server backend requires URL to be configured
- if (this.tikaServerBackend == null) {
- throw new SolrException(
- ErrorCode.INVALID_STATE, "Tika Server backend requested but no
URL configured");
- }
- break;
- default:
- throw new SolrException(
- ErrorCode.BAD_REQUEST,
- "Invalid extraction backend: '"
- + this.defaultBackendName
- + "'. Must be one of: '"
- + LocalTikaExtractionBackend.NAME
- + "', '"
- + TikaServerExtractionBackend.NAME
- + "'");
+ // Validate backend name
+ if (!TikaServerExtractionBackend.NAME.equals(this.defaultBackendName)) {
+ throw new SolrException(
+ ErrorCode.BAD_REQUEST,
+ "Invalid extraction backend: '"
+ + this.defaultBackendName
+ + "'. Only '"
+ + TikaServerExtractionBackend.NAME
+ + "' is supported");
Review Comment:
[nitpick] This exception is thrown during core initialization for a
configuration issue. Consider using ErrorCode.SERVER_ERROR (or INVALID_STATE)
instead of BAD_REQUEST to reflect a server-side misconfiguration rather than a
client request error.
##########
solr/modules/extraction/src/java/org/apache/solr/handler/extraction/ExtractingRequestHandler.java:
##########
@@ -35,22 +35,16 @@
* Handler for rich documents like PDF or Word or any other file format that
Tika handles that need
* the text to be extracted first from the document.
*/
-@SuppressWarnings("removal")
public class ExtractingRequestHandler extends ContentStreamHandlerBase
implements SolrCoreAware, PermissionNameProvider {
private static final Logger log =
LoggerFactory.getLogger(MethodHandles.lookup().lookupClass());
public static final String PARSE_CONTEXT_CONFIG = "parseContext.config";
Review Comment:
The PARSE_CONTEXT_CONFIG constant is now unused after removing
local/parse-context support. Consider removing it to avoid confusion and keep
the public API surface minimal.
```suggestion
```
##########
solr/solr-ref-guide/modules/upgrade-notes/pages/major-changes-in-solr-10.adoc:
##########
@@ -172,6 +172,8 @@ Nowadays, the HTTP request is available via internal APIs:
`SolrQueryRequest.get
* The deprecated transient Solr cores capability has been removed.
(SOLR-17932)
+* LocalTikaExtractionBackend, which was deprecated in 9.10, has been removed.
The 'tikaserver' extraction backend is now the only supported backend for the
ExtractingRequestHandler, and the default. Users must configure a Tika Server
URL via the `tikaserver.url` parameter. (SOLR-17961)
+
Review Comment:
Consider expanding this note to mention that parse-context-based
configuration (parseContext.config) is no longer supported and that Tika
parser-specific properties must be configured directly on the Tika Server.
```suggestion
+
+NOTE: The previous parse-context-based configuration
(`parseContext.config`) is no longer supported. Tika parser-specific properties
must now be configured directly on the Tika Server itself, rather than through
Solr configuration. Please refer to the Tika Server documentation for details
on how to set these properties.
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]