Hi Lucas,

I'm not sure if it's a solution but It may help. I did something similar while 
uploading from query console.
If you are open to use query console/ mlcp transform, the below code I used to 
handle uri even with blank spaces.
I just used xdmp-url-encode(uri). We can create custome uris as required. 
Ignore if not relevant.

xquery version "1.0-ml";
import module namespace info = "http://marklogic.com/appservices/infostudio"; at 
"/MarkLogic/appservices/infostudio/info.xqy";
declare namespace ts = "http://marklogic.com/MLU/top-songs";;
let $path := "D:\test\songs"

for $d in xdmp:filesystem-directory($path)//dir:entry
let $filepath := $d/dir:pathname/string()
let $doc := xdmp:document-get($d//dir:pathname)
let $title := $doc/ts:top-song/ts:title/string()
let $artist := $doc/ts:top-song/ts:artist/string()
let $genre := $doc//ts:top-song//ts:genres/ts:genre/string()
let $ref-uri := fn:concat("/songs/",$artist,"/",$title,".xml")
let $options :=
  <options xmlns="xdmp:document-load">
      <uri>{ xdmp:url-encode(xs:string($ref-uri)) }</uri>
      <repair>none</repair>
      <permissions>{xdmp:default-permissions()}</permissions>
      <collections>
        <collection>songs-xml</collection>
        {
          for $gen in $doc//ts:top-song//ts:genres/ts:genre
          return <collection>{$gen/string()}</collection>
        }
      </collections>
    </options>
let $database := "test"
let $genlen := fn:string-length(xdmp:url-encode(xs:string($ref-uri)))
return
xdmp:document-load($filepath,$options)

Thanks and Regards,
-Abhishek Jain

From: [email protected] 
[mailto:[email protected]] On Behalf Of Geert Josten
Sent: Thursday, March 23, 2017 2:39 PM
To: MarkLogic Developer Discussion
Subject: Re: [MarkLogic Dev General] URI_ID whitespace problems with mlcp

Sorry, ignore my reply, it only applies to delimited_text. Thanks to Martijn 
for pointing that out to me..

@Lucas, you did not mention XML parsing errors, so maybe your XML is just fine, 
and all you try to do is take an attribute value and use that as uri. 
Unfortunately, you can't do that with -uri_id, it only takes xml element and 
json property names. To be able to do that would require using MLCP transforms..

Kind regards,
Geert

From: 
<[email protected]<mailto:[email protected]>>
 on behalf of Geert Josten 
<[email protected]<mailto:[email protected]>>
Reply-To: MarkLogic Developer Discussion 
<[email protected]<mailto:[email protected]>>
Date: Wednesday, March 22, 2017 at 8:18 PM
To: MarkLogic Developer Discussion 
<[email protected]<mailto:[email protected]>>
Subject: Re: [MarkLogic Dev General] URI_ID whitespace problems with mlcp

Valid points all, but MLCP warns about spaces in header names, and proceeds by 
converting them to underscores before generating XML out of them.

On the other hand, though unlikely nor practical, spaces in property names are 
allowed in JSON. ;-)

Cheers,
Geert

From: 
<[email protected]<mailto:[email protected]>>
 on behalf of Florent Georges <[email protected]<mailto:[email protected]>>
Reply-To: MarkLogic Developer Discussion 
<[email protected]<mailto:[email protected]>>
Date: Wednesday, March 22, 2017 at 3:01 PM
To: MarkLogic Developer Discussion 
<[email protected]<mailto:[email protected]>>
Subject: Re: [MarkLogic Dev General] URI_ID whitespace problems with mlcp

Hi,

That is indeed the most likely explanation.  Just to make it clear to the OP, 
in such a situation an XML parser MUST stop normal processing (see e.g. 
http://w3.org/TR/xml/#sec-terminology, and the fact that having "<a b>" where a 
start tag is possible is ultimately breaking the document production rule).
When it comes to XML (in general, not only with MarkLogic), sometimes working 
around validity might the right solution, depending on the technical and 
non-technical context.  But having ill-formed documents never is.  Fixing 
ill-formedness is always less painful than any other solution.
Just my 2 cents.  Regards,

--
Florent Georges
H2O Consulting
http://h2o.consulting/

On 22 March 2017 at 14:14, Martijn Sintemaartensdijk wrote:
Dear Lucas,

judging from your command, I think your input file contains an XML-starttag 
"<uri _id>" and corresponding endtag "</uri _id>". Unfortunately, XML tag names 
may not contain empty spaces (See also: 
https://www.w3.org/TR/2008/REC-xml-20081126/#NT-Name).

MLCP tries to interpret the xml-file and it reports an unexpected character, 
">". MLCP assumes "_id" to be an attribute name to the tag name "uri", like 
<uri _id="1234">. The next character following "_id" is therefore expected to 
be an equal sign.

I would advice you to request the output file be offered in accordance with the 
XML-specification, rather than trying to fix the document. Otherwise, I fear, 
you will be forced to use sed, or a something similar, to replace the malformed 
XML-tags through the entire document each and every time you receive a new 
version.


Met vriendelijke groet / Kind regards,



Martijn Sintemaartensdijk



[Image removed by sender.]



A: Einsteinbaan 12, 3439 NJ Nieuwegein

T: (+31) 06 40 59 09 36

E: [email protected]<mailto:[email protected]>

W: www.dikw.nl<http://www.dikw.nl/>



Hartelijk dank voor uw waardering en 
stem!<http://www.dikw.com/algemeen-nieuws/computable-awards-2016/>


[Image removed by sender. banner 468x60 DIKW 
prijswinnaar]<http://www.dikw.com/algemeen-nieuws/computable-awards-2016/>

On 21 March 2017 at 19:02, Lucas Davenport 
<[email protected]<mailto:[email protected]>> wrote:
I am a newb, so forgive me if I missed this answer while searching.

I am testing ML 8 for a project at work and we have a requirement to load large 
amounts of historical data. I've read the mlcp documentation and can 
successfully import some test data, but the problem I am facing is the archive 
data has a space in the record identifier.

My command is:
 mlcp.sh import -host localhost -port 8006 -username dataload -password 
dataload -mode local -input_file_path ../xml/MD2014aggregate.xml 
-input_file_type aggregates -aggregate_record_element row -uri_id "row _id" 
-output_uri_prefix /traffic/MD -output_uri_suffix .xml -output_collections 
published

This produces the following error:
17/03/21 13:49:20 ERROR contentpump.ContentPump: Unrecognized argument: \_id

I've escaped both the space and the underscore (row\ _id and row\ \_id) and 
still get the same error. I've also wrapped in in single quotes and double 
quotes.

I'm trying to keep from having to use sed to remove the space between row and 
_id in the entire file.

Is there a way to make mlcp see the URI_ID literally as "row _id"?

Thanks in advance.

_______________________________________________
General mailing list
[email protected]<mailto:[email protected]>
Manage your subscription at:
http://developer.marklogic.com/mailman/listinfo/general


_______________________________________________
General mailing list
[email protected]<mailto:[email protected]>
Manage your subscription at:
http://developer.marklogic.com/mailman/listinfo/general


This message contains information that may be privileged or confidential and is 
the property of the Capgemini Group. It is intended only for the person to whom 
it is addressed. If you are not the intended recipient, you are not authorized 
to read, print, retain, copy, disseminate, distribute, or use this message or 
any part thereof. If you receive this message in error, please notify the 
sender immediately and delete all copies of this message.
_______________________________________________
General mailing list
[email protected]
Manage your subscription at: 
http://developer.marklogic.com/mailman/listinfo/general

Reply via email to