[ 
https://issues.apache.org/jira/browse/SOLR-12798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16631149#comment-16631149
 ] 

Karl Wright commented on SOLR-12798:
------------------------------------

[~janhoy]:

{quote}
That would be for case 1) where  you don't do Tika stuff on the MCF side but 
want Solr to handle the binary stream. In this case there should be no problem 
with huge metadata request params. And I agree that SolrJ should support this 
case (ContentStreamUpdateRequest?).
{quote}

Ok.  At the moment that sort of request seems to be transmitted with standard 
POST with metadata stuffed into the URL.  So a fix is needed for that.

{code}
I got confused by your other use case where you parse the file with Tika on the 
MCF side and still sent the text to /extract
{code}

While Julien has a custom Solr handler, that's not what we typically do, and we 
recommend that already-Tika-extracted content and metadata be sent to the 
/update handler.  In that case, we build a SolrInputDocument from the content 
stream, and add it into an UpdateRequest.  This mode of usage also seems to use 
standard POST or even PUT, and it puts all the metadata parameters on the URL.  
This is transmitted to the /update handler.  Do you want to support the case 
where the metadata parameters are sizable enough that the URL exceeds 8192 
bytes?






> Structural changes in SolrJ since version 7.0.0 have effectively disabled 
> multipart post
> ----------------------------------------------------------------------------------------
>
>                 Key: SOLR-12798
>                 URL: https://issues.apache.org/jira/browse/SOLR-12798
>             Project: Solr
>          Issue Type: Improvement
>      Security Level: Public(Default Security Level. Issues are Public) 
>          Components: SolrJ
>    Affects Versions: 7.4
>            Reporter: Karl Wright
>            Assignee: Karl Wright
>            Priority: Major
>         Attachments: HOT Balloon Trip_Ultra HD.jpg, 
> SOLR-12798-approach.patch, SOLR-12798-reproducer.patch, no params in url.png, 
> solr-update-request.txt
>
>
> Project ManifoldCF uses SolrJ to post documents to Solr.  When upgrading from 
> SolrJ 7.0.x to SolrJ 7.4, we encountered significant structural changes to 
> SolrJ's HttpSolrClient class that seemingly disable any use of multipart 
> post.  This is critical because ManifoldCF's documents often contain metadata 
> in excess of 4K that therefore cannot be stuffed into a URL.
> The changes in question seem to have been performed by Paul Noble on 
> 10/31/2017, with the introduction of the RequestWriter mechanism.  Basically, 
> if a request has a RequestWriter, it is used exclusively to write the 
> request, and that overrides the stream mechanism completely.  I haven't 
> chased it back to a specific ticket.
> ManifoldCF's usage of SolrJ involves the creation of 
> ContentStreamUpdateRequests for all posts meant for Solr Cell, and the 
> creation of UpdateRequests for posts not meant for Solr Cell (as well as for 
> delete and commit requests).  For our release cycle that is taking place 
> right now, we're shipping a modified version of HttpSolrClient that ignores 
> the RequestWriter when dealing with ContentStreamUpdateRequests.  We 
> apparently cannot use multipart for all requests because on the Solr side we 
> get "pfountz Should not get here!" errors on the Solr side when we do, which 
> generate HTTP error code 500 responses.  That should not happen either, in my 
> opinion.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to