Hi Karl, I track down problem to this:
A metadata is causing this. If I select only ID metadata (Normally I select all these : Created, FileLeafRef, ID, IKAccessGroup, IKContentType, IKDocuments, IKExpertise, IKExplanation, IKFAQ, IKImportant, Modified, Title ) all aspx files are indexed successfully. So contentStreamUpdateRequest.addContentStream(new RepositoryDocumentStream(is,length)); part is not the problem. I suspect one of the metadata field is very long (some metadata fields have html tags in it.) Is there a limitation on modifiable solr params? &literal.IKFAQ="very long text that contains some html tags" I will investigate further. Thanks, Ahmet --- On Mon, 1/14/13, Karl Wright <[email protected]> wrote: > From: Karl Wright <[email protected]> > Subject: Re: Repeated service interruptions - failure processing document: > null > To: [email protected] > Date: Monday, January 14, 2013, 10:34 PM > Let's try to figure out why we can't > index streamed data from these > .aspx files. Can you add enough debugging output to > figure out what > the connector is actually trying to stream to Solr? In > order to do > that you may well need to write a class that wraps the input > stream > that is handed to Solr with one that outputs enough > information for us > to make sense of this. > > What might be happening might be that the content length is > missing or > wrong, and as a result the transfer just keeps going or > something. > > Karl > > On Mon, Jan 14, 2013 at 3:23 PM, Ahmet Arslan <[email protected]> > wrote: > > Hi Karl, > > > > I think people may want to index content aspx files, so > treating them specially may not be a good solution. > > > > In our environment, aspx files are used to construct a > web site that used internally. In my understanding this one > of the use cases of SharePoint. In our case content of aspx > files are fetched from a List. We can access content of aspx > files from List. They don't have html tags etc in it. > > > > But I am not sure if this is common usage of aspx and > Lists. > > > > > > I was thinking some option like index only metadata > that simple ignores document it self. > > > > By the way I checked some of skipped aspx files their > sizes are not too big. 101 KB, 139 KB etc. > > > > I suspect some other factor is triggering this. Also I > am seeing this weird warning on jetty that runs solr. > > > > WARN:oejh.HttpParser:Full > [1771440721,-1,m=5,g=6144,p=6144,c=6144]={2F73 > > > > Thanks, > > Ahmet > > > > --- On Mon, 1/14/13, Karl Wright <[email protected]> > wrote: > > > >> From: Karl Wright <[email protected]> > >> Subject: Re: Repeated service interruptions - > failure processing document: null > >> To: [email protected] > >> Date: Monday, January 14, 2013, 6:46 PM > >> Hi Ahmet, > >> > >> We could specifically treat .aspx files specially, > so that > >> they are > >> considered to never have any content. But are > there > >> cases where > >> someone might want to index any content that these > URLs > >> might return? > >> Specifically, what do .aspx "files" typically > contain, when > >> found in a > >> SharePoint hierarchy? > >> > >> Karl > >> > >> On Mon, Jan 14, 2013 at 11:37 AM, Ahmet Arslan > <[email protected]> > >> wrote: > >> > Hi Karl, > >> > > >> > Now 39 aspx files (out of 130) are indexed. > Job didn't > >> get killed. No exceptions in the log. > >> > > >> > I increased the maximum POST size of > solr/jetty but > >> that 39 number didn't increased. > >> > > >> > I will check the size of remaining 130 - 39 > *.aspx > >> files. > >> > > >> > Actually I am mapping extracted content of > this aspx > >> files to a ignored dynamic field. > >> (fmap.content=content_ignored) I don't use them. I > am only > >> interested in metadata of these aspx files. It > would be > >> great if there is a setting to just grab > metadata. > >> Similar to Lists. > >> > > >> > Thanks, > >> > Ahmet > >> > > >> > --- On Mon, 1/14/13, Karl Wright <[email protected]> > >> wrote: > >> > > >> >> From: Karl Wright <[email protected]> > >> >> Subject: Re: Repeated service > interruptions - > >> failure processing document: null > >> >> To: [email protected] > >> >> Date: Monday, January 14, 2013, 5:46 PM > >> >> I checked in a fix for this ticket on > >> >> trunk. Please let me know if it > >> >> resolves this issue. > >> >> > >> >> Karl > >> >> > >> >> On Mon, Jan 14, 2013 at 10:20 AM, Karl > Wright > >> <[email protected]> > >> >> wrote: > >> >> > This is because httpclient is > retrying on > >> error for > >> >> three times by > >> >> > default. This has to be > disabled in the > >> Solr > >> >> connector, or the rest > >> >> > of the logic won't work right. > >> >> > > >> >> > I've opened a ticket (CONNECTORS-610) > for this > >> problem > >> >> too. > >> >> > > >> >> > Karl > >> >> > > >> >> > On Mon, Jan 14, 2013 at 10:13 AM, > Ahmet Arslan > >> <[email protected]> > >> >> wrote: > >> >> >> Hi Karl, > >> >> >> > >> >> >> Thanks for quick fix. > >> >> >> > >> >> >> I am still seeing the following > error > >> after 'svn > >> >> up' and 'ant build' > >> >> >> > >> >> >> ERROR 2013-01-14 17:09:41,949 > (Worker > >> thread '6') - > >> >> Exception tossed: Repeated service > interruptions - > >> failure > >> >> processing document: null > >> >> >> > >> >> > >> > org.apache.manifoldcf.core.interfaces.ManifoldCFException: > >> >> Repeated service interruptions - failure > >> processing > >> >> document: null > >> >> >> > at > >> >> > >> > org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:585) > >> >> >> Caused by: > >> >> > org.apache.http.client.ClientProtocolException > >> >> >> > at > >> >> > >> > org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:909) > >> >> >> > at > >> >> > >> > org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:805) > >> >> >> > at > >> >> > >> > org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:784) > >> >> >> > at > >> >> > >> > org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:352) > >> >> >> > at > >> >> > >> > org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:181) > >> >> >> > at > >> >> > >> > org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:117) > >> >> >> > at > >> >> > >> > org.apache.manifoldcf.agents.output.solr.HttpPoster$IngestThread.run(HttpPoster.java:790) > >> >> >> Caused by: > >> >> > >> > org.apache.http.client.NonRepeatableRequestException: > >> Cannot > >> >> retry request with a non-repeatable > request > >> entity. > >> >> The cause lists the reason the original > request > >> failed. > >> >> >> > at > >> >> > >> > org.apache.http.impl.client.DefaultRequestDirector.tryExecute(DefaultRequestDirector.java:692) > >> >> >> > at > >> >> > >> > org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:523) > >> >> >> > at > >> >> > >> > org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:906) > >> >> >> > ... > >> 6 more > >> >> >> Caused by: > java.net.SocketException: > >> Broken pipe > >> >> >> > at > >> >> > java.net.SocketOutputStream.socketWrite0(Native > >> Method) > >> >> >> > at > >> >> > >> > java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:92) > >> >> >> > at > >> >> > >> > java.net.SocketOutputStream.write(SocketOutputStream.java:136) > >> >> >> > at > >> >> > >> > org.apache.http.impl.io.AbstractSessionOutputBuffer.write(AbstractSessionOutputBuffer.java:169) > >> >> >> > at > >> >> > >> > org.apache.http.impl.io.ChunkedOutputStream.flushCacheWithAppend(ChunkedOutputStream.java:110) > >> >> >> > at > >> >> > >> > org.apache.http.impl.io.ChunkedOutputStream.write(ChunkedOutputStream.java:165) > >> >> >> > at > >> >> > >> > org.apache.http.entity.InputStreamEntity.writeTo(InputStreamEntity.java:92) > >> >> >> > at > >> >> > >> > org.apache.http.entity.HttpEntityWrapper.writeTo(HttpEntityWrapper.java:98) > >> >> >> > at > >> >> > >> > org.apache.http.impl.client.EntityEnclosingRequestWrapper$EntityWrapper.writeTo(EntityEnclosingRequestWrapper.java:108) > >> >> >> > at > >> >> > >> > org.apache.http.impl.entity.EntitySerializer.serialize(EntitySerializer.java:122) > >> >> >> > at > >> >> > >> > org.apache.http.impl.AbstractHttpClientConnection.sendRequestEntity(AbstractHttpClientConnection.java:271) > >> >> >> > at > >> >> > >> > org.apache.http.impl.conn.ManagedClientConnectionImpl.sendRequestEntity(ManagedClientConnectionImpl.java:197) > >> >> >> > at > >> >> > >> > org.apache.http.protocol.HttpRequestExecutor.doSendRequest(HttpRequestExecutor.java:257) > >> >> >> > at > >> >> > >> > org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:125) > >> >> >> > at > >> >> > >> > org.apache.http.impl.client.DefaultRequestDirector.tryExecute(DefaultRequestDirector.java:718) > >> >> >> > ... > >> 8 more > >> >> >> > >> >> >> > >> >> >> > >> >> >> --- On Mon, 1/14/13, Karl Wright > <[email protected]> > >> >> wrote: > >> >> >> > >> >> >>> From: Karl Wright <[email protected]> > >> >> >>> Subject: Re: Repeated > service > >> interruptions - > >> >> failure processing document: null > >> >> >>> To: [email protected] > >> >> >>> Date: Monday, January 14, > 2013, 3:30 > >> PM > >> >> >>> Hi Ahmet, > >> >> >>> > >> >> >>> The exception that seems to > be causing > >> the > >> >> abort is a socket > >> >> >>> exception > >> >> >>> coming from a socket write: > >> >> >>> > >> >> >>> > Caused by: > >> java.net.SocketException: > >> >> Broken pipe > >> >> >>> > >> >> >>> This makes sense in light of > the http > >> code > >> >> returned from > >> >> >>> Solr, which > >> >> >>> was 413: http://www.checkupdown.com/status/E413.html . > >> >> >>> > >> >> >>> So there is nothing actually > *wrong* > >> with the > >> >> .aspx > >> >> >>> documents, but > >> >> >>> they are just way too big, > and Solr > >> is > >> >> rejecting them for > >> >> >>> that reason. > >> >> >>> > >> >> >>> Clearly, though, the Solr > connector > >> should > >> >> recognize this > >> >> >>> code as > >> >> >>> meaning "never retry", so > instead of > >> killing > >> >> the job, it > >> >> >>> should just > >> >> >>> skip the document. I'll > open a > >> ticket for > >> >> that now. > >> >> >>> > >> >> >>> Karl > >> >> >>> > >> >> >>> > >> >> >>> On Mon, Jan 14, 2013 at 8:22 > AM, Ahmet > >> Arslan > >> >> <[email protected]> > >> >> >>> wrote: > >> >> >>> > Hello, > >> >> >>> > > >> >> >>> > I am indexing a > SharePoint 2010 > >> instance > >> >> using > >> >> >>> mcf-trunk (At revision > 1432907) > >> >> >>> > > >> >> >>> > There is no problem with > a > >> Document > >> >> library that > >> >> >>> contains word excel etc. > >> >> >>> > > >> >> >>> > However, I receive the > following > >> errors > >> >> with a Document > >> >> >>> library that has *.aspx files > in it. > >> >> >>> > > >> >> >>> > Status of Jobs => > Error: > >> Repeated > >> >> service > >> >> >>> interruptions - failure > processing > >> document: > >> >> null > >> >> >>> > > >> >> >>> > WARN 2013-01-14 > >> 15:00:12,720 (Worker > >> >> thread '13') > >> >> >>> - Service interruption > reported for > >> job > >> >> 1358009105156 > >> >> >>> connection 'iknow': IO > exception > >> during > >> >> indexing: null > >> >> >>> > ERROR 2013-01-14 > 15:00:12,763 > >> (Worker > >> >> thread '13') - > >> >> >>> Exception tossed: Repeated > service > >> >> interruptions - failure > >> >> >>> processing document: null > >> >> >>> > > >> >> >>> > >> >> > >> > org.apache.manifoldcf.core.interfaces.ManifoldCFException: > >> >> >>> Repeated service > interruptions - > >> failure > >> >> processing > >> >> >>> document: null > >> >> >>> > > >> at > >> >> >>> > >> >> > >> > org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:585) > >> >> >>> > Caused by: > >> >> >>> > >> org.apache.http.client.ClientProtocolException > >> >> >>> > > >> at > >> >> >>> > >> >> > >> > org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:909) > >> >> >>> > > >> at > >> >> >>> > >> >> > >> > org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:805) > >> >> >>> > > >> at > >> >> >>> > >> >> > >> > org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:784) > >> >> >>> > > >> at > >> >> >>> > >> >> > >> > org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:352) > >> >> >>> > > >> at > >> >> >>> > >> >> > >> > org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:181) > >> >> >>> > > >> at > >> >> >>> > >> >> > >> > org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:117) > >> >> >>> > > >> at > >> >> >>> > >> >> > >> > org.apache.manifoldcf.agents.output.solr.HttpPoster$IngestThread.run(HttpPoster.java:768) > >> >> >>> > Caused by: > >> >> >>> > >> >> > >> > org.apache.http.client.NonRepeatableRequestException: > >> >> Cannot > >> >> >>> retry request with a > non-repeatable > >> request > >> >> entity. > >> >> >>> The cause lists the reason > the > >> original request > >> >> failed. > >> >> >>> > > >> at > >> >> >>> > >> >> > >> > org.apache.http.impl.client.DefaultRequestDirector.tryExecute(DefaultRequestDirector.java:692) > >> >> >>> > > >> at > >> >> >>> > >> >> > >> > org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:523) > >> >> >>> > > >> at > >> >> >>> > >> >> > >> > org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:906) > >> >> >>> > > >> ... > >> >> 6 more > >> >> >>> > Caused by: > >> java.net.SocketException: > >> >> Broken pipe > >> >> >>> > > >> at > >> >> >>> > >> java.net.SocketOutputStream.socketWrite0(Native > >> >> Method) > >> >> >>> > > >> at > >> >> >>> > >> >> > >> > java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:92) > >> >> >>> > > >> at > >> >> >>> > >> >> > >> > java.net.SocketOutputStream.write(SocketOutputStream.java:136) > >> >> >>> > > >> at > >> >> >>> > >> >> > >> > org.apache.http.impl.io.AbstractSessionOutputBuffer.write(AbstractSessionOutputBuffer.java:169) > >> >> >>> > > >> at > >> >> >>> > >> >> > >> > org.apache.http.impl.io.ChunkedOutputStream.flushCacheWithAppend(ChunkedOutputStream.java:110) > >> >> >>> > > >> at > >> >> >>> > >> >> > >> > org.apache.http.impl.io.ChunkedOutputStream.write(ChunkedOutputStream.java:165) > >> >> >>> > > >> at > >> >> >>> > >> >> > >> > org.apache.http.entity.InputStreamEntity.writeTo(InputStreamEntity.java:92) > >> >> >>> > > >> at > >> >> >>> > >> >> > >> > org.apache.http.entity.HttpEntityWrapper.writeTo(HttpEntityWrapper.java:98) > >> >> >>> > > >> at > >> >> >>> > >> >> > >> > org.apache.http.impl.client.EntityEnclosingRequestWrapper$EntityWrapper.writeTo(EntityEnclosingRequestWrapper.java:108) > >> >> >>> > > >> at > >> >> >>> > >> >> > >> > org.apache.http.impl.entity.EntitySerializer.serialize(EntitySerializer.java:122) > >> >> >>> > > >> at > >> >> >>> > >> >> > >> > org.apache.http.impl.AbstractHttpClientConnection.sendRequestEntity(AbstractHttpClientConnection.java:271) > >> >> >>> > > >> at > >> >> >>> > >> >> > >> > org.apache.http.impl.conn.ManagedClientConnectionImpl.sendRequestEntity(ManagedClientConnectionImpl.java:197) > >> >> >>> > > >> at > >> >> >>> > >> >> > >> > org.apache.http.protocol.HttpRequestExecutor.doSendRequest(HttpRequestExecutor.java:257) > >> >> >>> > > >> at > >> >> >>> > >> >> > >> > org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:125) > >> >> >>> > > >> at > >> >> >>> > >> >> > >> > org.apache.http.impl.client.DefaultRequestDirector.tryExecute(DefaultRequestDirector.java:718) > >> >> >>> > > >> ... > >> >> 8 more > >> >> >>> > > >> >> >>> > Status of Jobs => > Error: > >> Unhandled Solr > >> >> exception > >> >> >>> during indexing (0): Server > at http://localhost:8983/solr/all returned non ok > >> >> >>> status:413, message:FULL > head > >> >> >>> > > >> >> >>> > > >> >> ERROR 2013-01-14 > >> >> >>> 15:10:42,074 (Worker thread > '15') - > >> Exception > >> >> tossed: > >> >> >>> Unhandled Solr exception > during > >> indexing (0): > >> >> Server at http://localhost:8983/solr/all returned > >> >> non ok > >> >> >>> status:413, message:FULL > head > >> >> >>> > > >> >> >>> > >> >> > >> > org.apache.manifoldcf.core.interfaces.ManifoldCFException: > >> >> >>> Unhandled Solr exception > during > >> indexing (0): > >> >> Server at http://localhost:8983/solr/all returned > >> >> non ok > >> >> >>> status:413, message:FULL > head > >> >> >>> > > >> at > >> >> >>> > >> >> > >> > org.apache.manifoldcf.agents.output.solr.HttpPoster.handleSolrException(HttpPoster.java:360) > >> >> >>> > > >> at > >> >> >>> > >> >> > >> > org.apache.manifoldcf.agents.output.solr.HttpPoster.indexPost(HttpPoster.java:477) > >> >> >>> > > >> at > >> >> >>> > >> >> > >> > org.apache.manifoldcf.agents.output.solr.SolrConnector.addOrReplaceDocument(SolrConnector.java:594) > >> >> >>> > > >> at > >> >> >>> > >> >> > >> > org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.addOrReplaceDocument(IncrementalIngester.java:1579) > >> >> >>> > > >> at > >> >> >>> > >> >> > >> > org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.performIngestion(IncrementalIngester.java:504) > >> >> >>> > > >> at > >> >> >>> > >> >> > >> > org.apache.manifoldcf.agents.incrementalingest.IncrementalIngester.documentIngest(IncrementalIngester.java:370) > >> >> >>> > > >> at > >> >> >>> > >> >> > >> > org.apache.manifoldcf.crawler.system.WorkerThread$ProcessActivity.ingestDocument(WorkerThread.java:1652) > >> >> >>> > > >> at > >> >> >>> > >> >> > >> > org.apache.manifoldcf.crawler.connectors.sharepoint.SharePointRepository.processDocuments(SharePointRepository.java:1559) > >> >> >>> > > >> at > >> >> >>> > >> >> > >> > org.apache.manifoldcf.crawler.connectors.BaseRepositoryConnector.processDocuments(BaseRepositoryConnector.java:423) > >> >> >>> > > >> at > >> >> >>> > >> >> > >> > org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:551) > >> >> >>> > > >> >> >>> > On the solr side I see > : > >> >> >>> > > >> >> >>> > INFO: Creating new http > client, > >> >> >>> > >> >> > >> > config:maxConnections=200&maxConnectionsPerHost=8 > >> >> >>> > 2013-01-14 > >> >> 15:18:21.775:WARN:oejh.HttpParser:Full > >> >> >>> > >> >> > >> > [671412972,-1,m=5,g=6144,p=6144,c=6144]={2F736F6C722F616 > >> >> >>> ...long long chars ... > 2B656B6970{} > >> >> >>> > > >> >> >>> > Thanks, > >> >> >>> > Ahmet > >> >> >>> > >> >> > >> > > >
