Hi Karl,

I track down problem to this: 

A metadata is causing this. If I select only ID metadata (Normally I select all 
these : Created, FileLeafRef, ID, IKAccessGroup, IKContentType, IKDocuments, 
IKExpertise, IKExplanation, IKFAQ, IKImportant, Modified, Title ) all aspx 
files are indexed successfully. 

So  contentStreamUpdateRequest.addContentStream(new 
RepositoryDocumentStream(is,length)); part is not the problem.

I suspect one of the metadata field is very long (some metadata fields have 
html tags in it.) Is there a limitation on modifiable solr params? 
&literal.IKFAQ="very long text that contains some html tags" 

I will investigate further.

Thanks,
Ahmet

--- On Mon, 1/14/13, Karl Wright <[email protected]> wrote:

> From: Karl Wright <[email protected]>
> Subject: Re: Repeated service interruptions - failure processing document: 
> null
> To: [email protected]
> Date: Monday, January 14, 2013, 10:34 PM
> Let's try to figure out why we can't
> index streamed data from these
> .aspx files.  Can you add enough debugging output to
> figure out what
> the connector is actually trying to stream to Solr?  In
> order to do
> that you may well need to write a class that wraps the input
> stream
> that is handed to Solr with one that outputs enough
> information for us
> to make sense of this.
> 
> What might be happening might be that the content length is
> missing or
> wrong, and as a result the transfer just keeps going or
> something.
> 
> Karl
> 
> On Mon, Jan 14, 2013 at 3:23 PM, Ahmet Arslan <[email protected]>
> wrote:
> > Hi Karl,
> >
> > I think people may want to index content aspx files, so
> treating them specially may not be a good solution.
> >
> > In our environment, aspx files are used to construct a
> web site that used internally. In my understanding this one
> of the use cases of SharePoint. In our case content of aspx
> files are fetched from a List. We can access content of aspx
> files from List. They don't have html tags etc in it.
> >
> > But I am not sure if this is common usage of aspx and
> Lists.
> >
> >
> > I was thinking some option like index only metadata
> that simple ignores document it self.
> >
> > By the way I checked some of skipped aspx files their
> sizes are not too big. 101 KB, 139 KB etc.
> >
> > I suspect some other factor is triggering this. Also I
> am seeing this weird warning on jetty that runs solr.
> >
> > WARN:oejh.HttpParser:Full
> [1771440721,-1,m=5,g=6144,p=6144,c=6144]={2F73
> >
> > Thanks,
> > Ahmet
> >
> > --- On Mon, 1/14/13, Karl Wright <[email protected]>
> wrote:
> >
> >> From: Karl Wright <[email protected]>
> >> Subject: Re: Repeated service interruptions -
> failure processing document: null
> >> To: [email protected]
> >> Date: Monday, January 14, 2013, 6:46 PM
> >> Hi Ahmet,
> >>
> >> We could specifically treat .aspx files specially,
> so that
> >> they are
> >> considered to never have any content.  But are
> there
> >> cases where
> >> someone might want to index any content that these
> URLs
> >> might return?
> >> Specifically, what do .aspx "files" typically
> contain, when
> >> found in a
> >> SharePoint hierarchy?
> >>
> >> Karl
> >>
> >> On Mon, Jan 14, 2013 at 11:37 AM, Ahmet Arslan
> <[email protected]>
> >> wrote:
> >> > Hi Karl,
> >> >
> >> > Now 39 aspx files (out of 130) are indexed.
> Job didn't
> >> get killed. No exceptions in the log.
> >> >
> >> > I increased the maximum POST size of
> solr/jetty but
> >> that 39 number didn't increased.
> >> >
> >> > I will check the size of remaining 130 - 39
> *.aspx
> >> files.
> >> >
> >> > Actually I am mapping extracted content of
> this aspx
> >> files to a ignored dynamic field.
> >> (fmap.content=content_ignored) I don't use them. I
> am only
> >> interested in metadata of these aspx files. It
> would be
> >> great if there is a setting  to just grab
> metadata.
> >> Similar to Lists.
> >> >
> >> > Thanks,
> >> > Ahmet
> >> >
> >> > --- On Mon, 1/14/13, Karl Wright <[email protected]>
> >> wrote:
> >> >
> >> >> From: Karl Wright <[email protected]>
> >> >> Subject: Re: Repeated service
> interruptions -
> >> failure processing document: null
> >> >> To: [email protected]
> >> >> Date: Monday, January 14, 2013, 5:46 PM
> >> >> I checked in a fix for this ticket on
> >> >> trunk.  Please let me know if it
> >> >> resolves this issue.
> >> >>
> >> >> Karl
> >> >>
> >> >> On Mon, Jan 14, 2013 at 10:20 AM, Karl
> Wright
> >> <[email protected]>
> >> >> wrote:
> >> >> > This is because httpclient is
> retrying on
> >> error for
> >> >> three times by
> >> >> > default.  This has to be
> disabled in the
> >> Solr
> >> >> connector, or the rest
> >> >> > of the logic won't work right.
> >> >> >
> >> >> > I've opened a ticket (CONNECTORS-610)
> for this
> >> problem
> >> >> too.
> >> >> >
> >> >> > Karl
> >> >> >
> >> >> > On Mon, Jan 14, 2013 at 10:13 AM,
> Ahmet Arslan
> >> <[email protected]>
> >> >> wrote:
> >> >> >> Hi Karl,
> >> >> >>
> >> >> >> Thanks for quick fix.
> >> >> >>
> >> >> >> I am still seeing the following
> error
> >> after 'svn
> >> >> up' and 'ant build'
> >> >> >>
> >> >> >> ERROR 2013-01-14 17:09:41,949
> (Worker
> >> thread '6') -
> >> >> Exception tossed: Repeated service
> interruptions -
> >> failure
> >> >> processing document: null
> >> >> >>
> >> >>
> >>
> org.apache.manifoldcf.core.interfaces.ManifoldCFException:
> >> >> Repeated service interruptions - failure
> >> processing
> >> >> document: null
> >> >> >>     
>    at
> >> >>
> >>
> org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:585)
> >> >> >> Caused by:
> >> >>
> org.apache.http.client.ClientProtocolException
> >> >> >>     
>    at
> >> >>
> >>
> org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:909)
> >> >> >>     
>    at
> >> >>
> >>
> org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:805)
> >> >> >>     
>    at
> >> >>
> >>
> org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:784)
> >> >> >>     
>    at
> >> >>
> >>
> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:352)
> >> >> >>     
>    at
> >> >>
> >>
> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:181)
> >> >> >>     
>    at
> >> >>
> >>
> org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:117)
> >> >> >>     
>    at
> >> >>
> >>
> org.apache.manifoldcf.agents.output.solr.HttpPoster$IngestThread.run(HttpPoster.java:790)
> >> >> >> Caused by:
> >> >>
> >>
> org.apache.http.client.NonRepeatableRequestException:
> >> Cannot
> >> >> retry request with a non-repeatable
> request
> >> entity.
> >> >> The cause lists the reason the original
> request
> >> failed.
> >> >> >>     
>    at
> >> >>
> >>
> org.apache.http.impl.client.DefaultRequestDirector.tryExecute(DefaultRequestDirector.java:692)
> >> >> >>     
>    at
> >> >>
> >>
> org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:523)
> >> >> >>     
>    at
> >> >>
> >>
> org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:906)
> >> >> >>     
>    ...
> >> 6 more
> >> >> >> Caused by:
> java.net.SocketException:
> >> Broken pipe
> >> >> >>     
>    at
> >> >>
> java.net.SocketOutputStream.socketWrite0(Native
> >> Method)
> >> >> >>     
>    at
> >> >>
> >>
> java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:92)
> >> >> >>     
>    at
> >> >>
> >>
> java.net.SocketOutputStream.write(SocketOutputStream.java:136)
> >> >> >>     
>    at
> >> >>
> >>
> org.apache.http.impl.io.AbstractSessionOutputBuffer.write(AbstractSessionOutputBuffer.java:169)
> >> >> >>     
>    at
> >> >>
> >>
> org.apache.http.impl.io.ChunkedOutputStream.flushCacheWithAppend(ChunkedOutputStream.java:110)
> >> >> >>     
>    at
> >> >>
> >>
> org.apache.http.impl.io.ChunkedOutputStream.write(ChunkedOutputStream.java:165)
> >> >> >>     
>    at
> >> >>
> >>
> org.apache.http.entity.InputStreamEntity.writeTo(InputStreamEntity.java:92)
> >> >> >>     
>    at
> >> >>
> >>
> org.apache.http.entity.HttpEntityWrapper.writeTo(HttpEntityWrapper.java:98)
> >> >> >>     
>    at
> >> >>
> >>
> org.apache.http.impl.client.EntityEnclosingRequestWrapper$EntityWrapper.writeTo(EntityEnclosingRequestWrapper.java:108)
> >> >> >>     
>    at
> >> >>
> >>
> org.apache.http.impl.entity.EntitySerializer.serialize(EntitySerializer.java:122)
> >> >> >>     
>    at
> >> >>
> >>
> org.apache.http.impl.AbstractHttpClientConnection.sendRequestEntity(AbstractHttpClientConnection.java:271)
> >> >> >>     
>    at
> >> >>
> >>
> org.apache.http.impl.conn.ManagedClientConnectionImpl.sendRequestEntity(ManagedClientConnectionImpl.java:197)
> >> >> >>     
>    at
> >> >>
> >>
> org.apache.http.protocol.HttpRequestExecutor.doSendRequest(HttpRequestExecutor.java:257)
> >> >> >>     
>    at
> >> >>
> >>
> org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:125)
> >> >> >>     
>    at
> >> >>
> >>
> org.apache.http.impl.client.DefaultRequestDirector.tryExecute(DefaultRequestDirector.java:718)
> >> >> >>     
>    ...
> >> 8 more
> >> >> >>
> >> >> >>
> >> >> >>
> >> >> >> --- On Mon, 1/14/13, Karl Wright
> <[email protected]>
> >> >> wrote:
> >> >> >>
> >> >> >>> From: Karl Wright <[email protected]>
> >> >> >>> Subject: Re: Repeated
> service
> >> interruptions -
> >> >> failure processing document: null
> >> >> >>> To: [email protected]
> >> >> >>> Date: Monday, January 14,
> 2013, 3:30
> >> PM
> >> >> >>> Hi Ahmet,
> >> >> >>>
> >> >> >>> The exception that seems to
> be causing
> >> the
> >> >> abort is a socket
> >> >> >>> exception
> >> >> >>> coming from a socket write:
> >> >> >>>
> >> >> >>> > Caused by:
> >> java.net.SocketException:
> >> >> Broken pipe
> >> >> >>>
> >> >> >>> This makes sense in light of
> the http
> >> code
> >> >> returned from
> >> >> >>> Solr, which
> >> >> >>> was 413:  http://www.checkupdown.com/status/E413.html .
> >> >> >>>
> >> >> >>> So there is nothing actually
> *wrong*
> >> with the
> >> >> .aspx
> >> >> >>> documents, but
> >> >> >>> they are just way too big,
> and Solr
> >> is
> >> >> rejecting them for
> >> >> >>> that reason.
> >> >> >>>
> >> >> >>> Clearly, though, the Solr
> connector
> >> should
> >> >> recognize this
> >> >> >>> code as
> >> >> >>> meaning "never retry", so
> instead of
> >> killing
> >> >> the job, it
> >> >> >>> should just
> >> >> >>> skip the document.  I'll
> open a
> >> ticket for
> >> >> that now.
> >> >> >>>
> >> >> >>> Karl
> >> >> >>>
> >> >> >>>
> >> >> >>> On Mon, Jan 14, 2013 at 8:22
> AM, Ahmet
> >> Arslan
> >> >> <[email protected]>
> >> >> >>> wrote:
> >> >> >>> > Hello,
> >> >> >>> >
> >> >> >>> > I am indexing a
> SharePoint 2010
> >> instance
> >> >> using
> >> >> >>> mcf-trunk (At revision
> 1432907)
> >> >> >>> >
> >> >> >>> > There is no problem with
> a
> >> Document
> >> >> library that
> >> >> >>> contains word excel etc.
> >> >> >>> >
> >> >> >>> > However, I receive the
> following
> >> errors
> >> >> with a Document
> >> >> >>> library that has *.aspx files
> in it.
> >> >> >>> >
> >> >> >>> > Status of Jobs =>
> Error:
> >> Repeated
> >> >> service
> >> >> >>> interruptions - failure
> processing
> >> document:
> >> >> null
> >> >> >>> >
> >> >> >>> >  WARN 2013-01-14
> >> 15:00:12,720 (Worker
> >> >> thread '13')
> >> >> >>> - Service interruption
> reported for
> >> job
> >> >> 1358009105156
> >> >> >>> connection 'iknow': IO
> exception
> >> during
> >> >> indexing: null
> >> >> >>> > ERROR 2013-01-14
> 15:00:12,763
> >> (Worker
> >> >> thread '13') -
> >> >> >>> Exception tossed: Repeated
> service
> >> >> interruptions - failure
> >> >> >>> processing document: null
> >> >> >>> >
> >> >> >>>
> >> >>
> >>
> org.apache.manifoldcf.core.interfaces.ManifoldCFException:
> >> >> >>> Repeated service
> interruptions -
> >> failure
> >> >> processing
> >> >> >>> document: null
> >> >> >>> >
> >>    at
> >> >> >>>
> >> >>
> >>
> org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:585)
> >> >> >>> > Caused by:
> >> >> >>>
> >> org.apache.http.client.ClientProtocolException
> >> >> >>> >
> >>    at
> >> >> >>>
> >> >>
> >>
> org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:909)
> >> >> >>> >
> >>    at
> >> >> >>>
> >> >>
> >>
> org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:805)
> >> >> >>> >
> >>    at
> >> >> >>>
> >> >>
> >>
> org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:784)
> >> >> >>> >
> >>    at
> >> >> >>>
> >> >>
> >>
> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:352)
> >> >> >>> >
> >>    at
> >> >> >>>
> >> >>
> >>
> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:181)
> >> >> >>> >
> >>    at
> >> >> >>>
> >> >>
> >>
> org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:117)
> >> >> >>> >
> >>    at
> >> >> >>>
> >> >>
> >>
> org.apache.manifoldcf.agents.output.solr.HttpPoster$IngestThread.run(HttpPoster.java:768)
> >> >> >>> > Caused by:
> >> >> >>>
> >> >>
> >>
> org.apache.http.client.NonRepeatableRequestException:
> >> >> Cannot
> >> >> >>> retry request with a
> non-repeatable
> >> request
> >> >> entity.
> >> >> >>> The cause lists the reason
> the
> >> original request
> >> >> failed.
> >> >> >>> >
> >>    at
> >> >> >>>
> >> >>
> >>
> org.apache.http.impl.client.DefaultRequestDirector.tryExecute(DefaultRequestDirector.java:692)
> >> >> >>> >
> >>    at
> >> >> >>>
> >> >>
> >>
> org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:523)
> >> >> >>> >
> >>    at
> >> >> >>>
> >> >>
> >>
> org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:906)
> >> >> >>> >
> >>    ...
> >> >> 6 more
> >> >> >>> > Caused by:
> >> java.net.SocketException:
> >> >> Broken pipe
> >> >> >>> >
> >>    at
> >> >> >>>
> >> java.net.SocketOutputStream.socketWrite0(Native
> >> >> Method)
> >> >> >>> >
> >>    at
> >> >> >>>
> >> >>
> >>
> java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:92)
> >> >> >>> >
> >>    at
> >> >> >>>
> >> >>
> >>
> java.net.SocketOutputStream.write(SocketOutputStream.java:136)
> >> >> >>> >
> >>    at
> >> >> >>>
> >> >>
> >>
> org.apache.http.impl.io.AbstractSessionOutputBuffer.write(AbstractSessionOutputBuffer.java:169)
> >> >> >>> >
> >>    at
> >> >> >>>
> >> >>
> >>
> org.apache.http.impl.io.ChunkedOutputStream.flushCacheWithAppend(ChunkedOutputStream.java:110)
> >> >> >>> >
> >>    at
> >> >> >>>
> >> >>
> >>
> org.apache.http.impl.io.ChunkedOutputStream.write(ChunkedOutputStream.java:165)
> >> >> >>> >
> >>    at
> >> >> >>>
> >> >>
> >>
> org.apache.http.entity.InputStreamEntity.writeTo(InputStreamEntity.java:92)
> >> >> >>> >
> >>    at
> >> >> >>>
> >> >>
> >>
> org.apache.http.entity.HttpEntityWrapper.writeTo(HttpEntityWrapper.java:98)
> >> >> >>> >
> >>    at
> >> >> >>>
> >> >>
> >>
> org.apache.http.impl.client.EntityEnclosingRequestWrapper$EntityWrapper.writeTo(EntityEnclosingRequestWrapper.java:108)
> >> >> >>> >
> >>    at
> >> >> >>>
> >> >>
> >>
> org.apache.http.impl.entity.EntitySerializer.serialize(EntitySerializer.java:122)
> >> >> >>> >
> >>    at
> >> >> >>>
> >> >>
> >>
> org.apache.http.impl.AbstractHttpClientConnection.sendRequestEntity(AbstractHttpClientConnection.java:271)
> >> >> >>> >
> >>    at
> >> >> >>>
> >> >>
> >>
> org.apache.http.impl.conn.ManagedClientConnectionImpl.sendRequestEntity(ManagedClientConnectionImpl.java:197)
> >> >> >>> >
> >>    at
> >> >> >>>
> >> >>
> >>
> org.apache.http.protocol.HttpRequestExecutor.doSendRequest(HttpRequestExecutor.java:257)
> >> >> >>> >
> >>    at
> >> >> >>>
> >> >>
> >>
> org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:125)
> >> >> >>> >
> >>    at
> >> >> >>>
> >> >>
> >>
> org.apache.http.impl.client.DefaultRequestDirector.tryExecute(DefaultRequestDirector.java:718)
> >> >> >>> >
> >>    ...
> >> >> 8 more
> >> >> >>> >
> >> >> >>> > Status of Jobs =>
> Error:
> >> Unhandled Solr
> >> >> exception
> >> >> >>> during indexing (0): Server
> at http://localhost:8983/solr/all returned non ok
> >> >> >>> status:413, message:FULL
> head
> >> >> >>> >
> >> >> >>> >
> >> >>    ERROR 2013-01-14
> >> >> >>> 15:10:42,074 (Worker thread
> '15') -
> >> Exception
> >> >> tossed:
> >> >> >>> Unhandled Solr exception
> during
> >> indexing (0):
> >> >> Server at http://localhost:8983/solr/all returned
> >> >> non ok
> >> >> >>> status:413, message:FULL
> head
> >> >> >>> >
> >> >> >>>
> >> >>
> >>
> org.apache.manifoldcf.core.interfaces.ManifoldCFException:
> >> >> >>> Unhandled Solr exception
> during
> >> indexing (0):
> >> >> Server at http://localhost:8983/solr/all returned
> >> >> non ok
> >> >> >>> status:413, message:FULL
> head
> >> >> >>> >
> >>    at
> >> >> >>>
> >> >>
> >>
> org.apache.manifoldcf.agents.output.solr.HttpPoster.handleSolrException(HttpPoster.java:360)
> >> >> >>> >
> >>    at
> >> >> >>>
> >> >>
> >>
> org.apache.manifoldcf.agents.output.solr.HttpPoster.indexPost(HttpPoster.java:477)
> >> >> >>> >
> >>    at
> >> >> >>>
> >> >>
> >>
> org.apache.manifoldcf.agents.output.solr.SolrConnector.addOrReplaceDocument(SolrConnector.java:594)
> >> >> >>> >
> >>    at
> >> >> >>>
> >> >>
> >>
> org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.addOrReplaceDocument(IncrementalIngester.java:1579)
> >> >> >>> >
> >>    at
> >> >> >>>
> >> >>
> >>
> org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.performIngestion(IncrementalIngester.java:504)
> >> >> >>> >
> >>    at
> >> >> >>>
> >> >>
> >>
> org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.documentIngest(IncrementalIngester.java:370)
> >> >> >>> >
> >>    at
> >> >> >>>
> >> >>
> >>
> org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.ingestDocument(WorkerThread.java:1652)
> >> >> >>> >
> >>    at
> >> >> >>>
> >> >>
> >>
> org.apache.manifoldcf.crawler.connectors.sharepoint.SharePointRepository.processDocuments(SharePointRepository.java:1559)
> >> >> >>> >
> >>    at
> >> >> >>>
> >> >>
> >>
> org.apache.manifoldcf.crawler.connectors.BaseRepositoryConnector.processDocuments(BaseRepositoryConnector.java:423)
> >> >> >>> >
> >>    at
> >> >> >>>
> >> >>
> >>
> org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:551)
> >> >> >>> >
> >> >> >>> > On the solr side I see
> :
> >> >> >>> >
> >> >> >>> > INFO: Creating new http
> client,
> >> >> >>>
> >> >>
> >>
> config:maxConnections=200&maxConnectionsPerHost=8
> >> >> >>> > 2013-01-14
> >> >> 15:18:21.775:WARN:oejh.HttpParser:Full
> >> >> >>>
> >> >>
> >>
> [671412972,-1,m=5,g=6144,p=6144,c=6144]={2F736F6C722F616
> >> >> >>> ...long long chars ...
> 2B656B6970{}
> >> >> >>> >
> >> >> >>> > Thanks,
> >> >> >>> > Ahmet
> >> >> >>>
> >> >>
> >>
> >
>

Reply via email to