Hi,
I am trying to ingest a large number of files. The metadata for these files
exist in .met files.
Many of the metadata fields contain characters like '<>&$' etc.
Running crawler on these metadata results in failure.
When I try to escape the characters using HTML encode e.g. '>' becomes > etc
I still get errors and the crawler cannot ingest the files.
Here is an example of the offending lines in the .met file before and after
HTML encoding
<val>sailfish quant --index /reference/v1/Homo-sapiens/GRCh37.p12/SailFishIndex
--libtype 'T=PE:O=><:S=AS' -1 <(gunzip -c
/gpfs/archive/RED/DA0000072/RNA-Seq/RawData/FastqFiles/HP1_3_R1.fastq.gz) -2
<(gunzip -c
/gpfs/archive/RED/DA0000072/RNA-Seq/RawData/FastqFiles/HP1_3_R2.fastq.gz) -o
/gpfs/archive/RED/DA0000072/RNA-Seq/Processed/Sailfish-transcriptCounts/HP1_3.Sailfish.txt
-p 8 --no_bias_correct </val>
<val>sailfish quant --index /reference/v1/Homo-sapiens/GRCh37.p12/SailFishIndex
--libtype 'T=PE:O=><:S=AS' -1 <(gunzip -c
/gpfs/archive/RED/DA0000072/RNA-Seq/RawData/FastqFiles/HP1_3_R1.fastq.gz) -2
<(gunzip -c
/gpfs/archive/RED/DA0000072/RNA-Seq/RawData/FastqFiles/HP1_3_R2.fastq.gz) -o
/gpfs/archive/RED/DA0000072/RNA-Seq/Processed/Sailfish-transcriptCounts/HP1_3.Sailfish.txt
-p 8 --no_bias_correct </val>
If I remove the offending characters ( in this case '<>') the ingestion goes
one without any issues
The crawler command is :
./crawler_launcher --operation --launchAutoCrawler --productPath $FILEPATH
--filemgrUrl $OODT_FILEMGR_URL --clientTransferer
org.apache.oodt.cas.filemgr.datatransfer.InPlaceDataTransferFactory
--mimeExtractorRe
po ../policy/mime-extractor-map.xml --noRecur --crawlForDirs
The error message I get when I run the crawler is:
INFO: StdIngester: ingesting product: ProductName: [A1_1.Sailfish.sfish]:
ProductType: [GenericFile]: FileLocation:
[/datavault/RNA-Seq/Processed/Sailfish-transcriptCounts/]
org.apache.xmlrpc.XmlRpcException: java.lang.Exception:
org.apache.oodt.cas.filemgr.structs.exceptions.CatalogException: Error
ingesting product [org.apache.oodt.cas.filemgr.structs.Product@5d3f1d87] : HTTP
method failed: HTTP/1.1 400 Bad Request
at
org.apache.xmlrpc.XmlRpcClientResponseProcessor.decodeException(XmlRpcClientResponseProcessor.java:104)
at
org.apache.xmlrpc.XmlRpcClientResponseProcessor.decodeResponse(XmlRpcClientResponseProcessor.java:71)
at
org.apache.xmlrpc.XmlRpcClientWorker.execute(XmlRpcClientWorker.java:73)
at org.apache.xmlrpc.XmlRpcClient.execute(XmlRpcClient.java:194)
at org.apache.xmlrpc.XmlRpcClient.execute(XmlRpcClient.java:185)
at org.apache.xmlrpc.XmlRpcClient.execute(XmlRpcClient.java:178)
at
org.apache.oodt.cas.filemgr.system.XmlRpcFileManagerClient.ingestProduct(XmlRpcFileManagerClient.java:1178)
at
org.apache.oodt.cas.filemgr.ingest.StdIngester.ingest(StdIngester.java:199)
at
org.apache.oodt.cas.crawl.ProductCrawler.ingest(ProductCrawler.java:304)
at
org.apache.oodt.cas.crawl.ProductCrawler.handleFile(ProductCrawler.java:188)
at org.apache.oodt.cas.crawl.ProductCrawler.crawl(ProductCrawler.java:108)
at org.apache.oodt.cas.crawl.ProductCrawler.crawl(ProductCrawler.java:75)
at
org.apache.oodt.cas.crawl.cli.action.CrawlerLauncherCliAction.execute(CrawlerLauncherCliAction.java:58)
at org.apache.oodt.cas.cli.CmdLineUtility.execute(CmdLineUtility.java:331)
at org.apache.oodt.cas.cli.CmdLineUtility.run(CmdLineUtility.java:187)
at org.apache.oodt.cas.crawl.CrawlerLauncher.main(CrawlerLauncher.java:36)
Oct 07, 2014 11:17:18 PM
org.apache.oodt.cas.filemgr.system.XmlRpcFileManagerClient ingestProduct
SEVERE: Failed to ingest product
[org.apache.oodt.cas.filemgr.structs.Product@7c1ba98b] : java.lang.Exception:
org.apache.oodt.cas.filemgr.structs.exceptions.CatalogException: Error
ingesting product [org.apache.oodt.cas.filemgr.structs.Product@5d3f1d87] : HTTP
method failed: HTTP/1.1 400 Bad Request -- rolling back ingest
java.lang.Exception: Failed to ingest product
[org.apache.oodt.cas.filemgr.structs.Product@7c1ba98b] : java.lang.Exception:
org.apache.oodt.cas.filemgr.structs.exceptions.CatalogException: Error
ingesting product [org.apache.oodt.cas.filemgr.structs.Product@5d3f1d87] : HTTP
method failed: HTTP/1.1 400 Bad Request
at
org.apache.oodt.cas.filemgr.system.XmlRpcFileManagerClient.ingestProduct(XmlRpcFileManagerClient.java:1279)
at
org.apache.oodt.cas.filemgr.ingest.StdIngester.ingest(StdIngester.java:199)
at
org.apache.oodt.cas.crawl.ProductCrawler.ingest(ProductCrawler.java:304)
at
org.apache.oodt.cas.crawl.ProductCrawler.handleFile(ProductCrawler.java:188)
at org.apache.oodt.cas.crawl.ProductCrawler.crawl(ProductCrawler.java:108)
at org.apache.oodt.cas.crawl.ProductCrawler.crawl(ProductCrawler.java:75)
at
org.apache.oodt.cas.crawl.cli.action.CrawlerLauncherCliAction.execute(CrawlerLauncherCliAction.java:58)
at org.apache.oodt.cas.cli.CmdLineUtility.execute(CmdLineUtility.java:331)
at org.apache.oodt.cas.cli.CmdLineUtility.run(CmdLineUtility.java:187)
at org.apache.oodt.cas.crawl.CrawlerLauncher.main(CrawlerLauncher.java:36)
Oct 07, 2014 11:17:18 PM org.apache.oodt.cas.filemgr.ingest.StdIngester ingest
WARNING: exception ingesting product: [A1_1.Sailfish.sfish]: Message: Failed to
ingest product [org.apache.oodt.cas.filemgr.structs.Product@7c1ba98b] :
java.lang.Exception:
org.apache.oodt.cas.filemgr.structs.exceptions.CatalogException: Error
ingesting product [org.apache.oodt.cas.filemgr.structs.Product@5d3f1d87] : HTTP
method failed: HTTP/1.1 400 Bad Request
Oct 07, 2014 11:17:18 PM org.apache.oodt.cas.crawl.ProductCrawler ingest
WARNING: ProductCrawler: Exception ingesting product:
[/datavault/RNA-Seq/Processed/Sailfish-transcriptCounts/A1_1.Sailfish.sfish]:
Message: exception ingesting product: [A1_1.Sailfish.sfish]: Message: Failed to
ingest product [org.apache.oodt.cas.filemgr.structs.Product@7c1ba98b] :
java.lang.Exception:
org.apache.oodt.cas.filemgr.structs.exceptions.CatalogException: Error
ingesting product [org.apache.oodt.cas.filemgr.structs.Product@5d3f1d87] : HTTP
method failed: HTTP/1.1 400 Bad Request: attempting to continue crawling
org.apache.oodt.cas.filemgr.structs.exceptions.IngestException: exception
ingesting product: [A1_1.Sailfish.sfish]: Message: Failed to ingest product
[org.apache.oodt.cas.filemgr.structs.Product@7c1ba98b] : java.lang.Exception:
org.apache.oodt.cas.filemgr.structs.exceptions.CatalogException: Error
ingesting product [org.apache.oodt.cas.filemgr.structs.Product@5d3f1d87] : HTTP
method failed: HTTP/1.1 400 Bad Request
at
org.apache.oodt.cas.filemgr.ingest.StdIngester.ingest(StdIngester.java:204)
at
org.apache.oodt.cas.crawl.ProductCrawler.ingest(ProductCrawler.java:304)
at
org.apache.oodt.cas.crawl.ProductCrawler.handleFile(ProductCrawler.java:188)
at org.apache.oodt.cas.crawl.ProductCrawler.crawl(ProductCrawler.java:108)
at org.apache.oodt.cas.crawl.ProductCrawler.crawl(ProductCrawler.java:75)
at
org.apache.oodt.cas.crawl.cli.action.CrawlerLauncherCliAction.execute(CrawlerLauncherCliAction.java:58)
at org.apache.oodt.cas.cli.CmdLineUtility.execute(CmdLineUtility.java:331)
at org.apache.oodt.cas.cli.CmdLineUtility.run(CmdLineUtility.java:187)
at org.apache.oodt.cas.crawl.CrawlerLauncher.main(CrawlerLauncher.java:36)
Oct 07, 2014 11:17:18 PM org.apache.oodt.cas.crawl.ProductCrawler handleFile
WARNING: Failed to ingest product:
[/datavault/RNA-Seq/Processed/Sailfish-transcriptCounts/A1_1.Sailfish.sfish]:
performing postIngestFail actions
Any ideas how I can ingest these files?
Thanks
K
*********************************************************
THIS ELECTRONIC MAIL MESSAGE AND ANY ATTACHMENT IS
CONFIDENTIAL AND MAY CONTAIN LEGALLY PRIVILEGED
INFORMATION INTENDED ONLY FOR THE USE OF THE INDIVIDUAL
OR INDIVIDUALS NAMED ABOVE.
If the reader is not the intended recipient, or the
employee or agent responsible to deliver it to the
intended recipient, you are hereby notified that any
dissemination, distribution or copying of this
communication is strictly prohibited. If you have
received this communication in error, please reply to the
sender to notify us of the error and delete the original
message. Thank You.