Re: Cleaning XML - Unicode 0x0 SOLVED sorta

2006-11-07 Thread Josh Nathanson
OK, I added this to my regex:

\x00

Which is a hex representation of the character 0.  And it worked.

Not sure why chr(0) didn't work.

Yes it's non scalable...but, since the data is not going into the database 
as xml, just plain old form fields, I can't use CDATA on the way in anyway, 
correct?  I would have to run the same regex on each of the incoming form 
fields that are text...so, this way is more scalable than that I guess.

-- Josh




- Original Message - 
From: Rob Wilkerson [EMAIL PROTECTED]
To: CF-Talk cf-talk@houseoffusion.com
Sent: Tuesday, November 07, 2006 10:19 AM
Subject: Re: Cleaning XML - Unicode 0x0


 On 11/7/06, Josh Nathanson [EMAIL PROTECTED] wrote:
 Thanks for your help Rob.  I just don't know which field is the culprit 
 as
 far as the null character (there's no description field or anything 
 obvious
 like that), and I'm hesitant to CDATA every single field that's going 
 into
 the db, unless I've exhausted every possible other option.

 I wouldn't apply a CDATA block to every field indiscriminately, but I
 would apply it to varchar and text fields where the data is likely to
 be quite variable.

 I'll keep grinding on trying to regex the null character out of there and
 let the list know if I figure anything out.

 The problem with this approach is that while it's currently the null
 character, next time it might be something else and then something
 else.  Your regex could just continue to grow.  I guess what I'm
 saying is that it's not really a scalable solution.

 Handling invalid character in a batch manner by including them in a
 CDATA block or by understanding how those characters are being
 inserted is a more workable long term solution.

 That said, adding this final character may turn out to be the last you
 ever hear of this particular problem.  :-)

 

~|
Introducing the Fusion Authority Quarterly Update. 80 pages of hard-hitting,
up-to-date ColdFusion information by your peers, delivered to your door four 
times a year.
http://www.fusionauthority.com/quarterly

Archive: 
http://www.houseoffusion.com/groups/CF-Talk/message.cfm/messageid:259480
Subscription: http://www.houseoffusion.com/groups/CF-Talk/subscribe.cfm
Unsubscribe: 
http://www.houseoffusion.com/cf_lists/unsubscribe.cfm?user=11502.10531.4


Re: Cleaning XML - Unicode 0x0 SOLVED sorta

2006-11-07 Thread Rob Wilkerson
On 11/7/06, Josh Nathanson [EMAIL PROTECTED] wrote:

 Yes it's non scalable...but, since the data is not going into the database
 as xml, just plain old form fields, I can't use CDATA on the way in anyway,
 correct?  I would have to run the same regex on each of the incoming form
 fields that are text...so, this way is more scalable than that I guess.

Maybe I'm not clear about how you're using XML.  Are you extracting
data from your DB into an XML format or doing something else?  I
assumed you were formatted data from the DB into XML for the purpose
of delivering it somewhere else.  If that's the case, the CDATA -
while it may or may not have solved this particular problem - is safer
than what you're doing and scalable.

Of course, I'm just beating a dead horse if you've already found
something you're happy with.  :-)

~|
Introducing the Fusion Authority Quarterly Update. 80 pages of hard-hitting,
up-to-date ColdFusion information by your peers, delivered to your door four 
times a year.
http://www.fusionauthority.com/quarterly

Archive: 
http://www.houseoffusion.com/groups/CF-Talk/message.cfm/messageid:259500
Subscription: http://www.houseoffusion.com/groups/CF-Talk/subscribe.cfm
Unsubscribe: http://www.houseoffusion.com/cf_lists/unsubscribe.cfm?user=89.70.4


RE: Cleaning XML - Unicode 0x0 SOLVED sorta

2006-11-07 Thread Matt Quackenbush
Josh,

I think the point that Rob and others were making is that your data should
be validated and cleaned up BEFORE being inserted into the database -
whether it's inserted as XML or not is completely and utterly irrelevant.
If you didn't have invalid data in the database, then you wouldn't have
invalid data in your XML.  But, since the data obviously is NOT being
validated and cleaned up before db entry, the best, most scalable, and most
widely accepted good practice would be to use CDATA in your XML.

Again though, what you're doing is just a bandaid that covers up the real
issue, which is invalid data being entered into the database.


Thanks,

Matt


-Original Message-
From: Josh Nathanson [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, November 07, 2006 1:14 PM
To: CF-Talk
Subject: Re: Cleaning XML - Unicode 0x0 SOLVED sorta

OK, I added this to my regex:

\x00

Which is a hex representation of the character 0.  And it worked.

Not sure why chr(0) didn't work.

Yes it's non scalable...but, since the data is not going into the database
as xml, just plain old form fields, I can't use CDATA on the way in anyway,
correct?  I would have to run the same regex on each of the incoming form
fields that are text...so, this way is more scalable than that I guess.

-- Josh




- Original Message -
From: Rob Wilkerson [EMAIL PROTECTED]
To: CF-Talk cf-talk@houseoffusion.com
Sent: Tuesday, November 07, 2006 10:19 AM
Subject: Re: Cleaning XML - Unicode 0x0


 On 11/7/06, Josh Nathanson [EMAIL PROTECTED] wrote:
 Thanks for your help Rob.  I just don't know which field is the culprit 
 as
 far as the null character (there's no description field or anything 
 obvious
 like that), and I'm hesitant to CDATA every single field that's going 
 into
 the db, unless I've exhausted every possible other option.

 I wouldn't apply a CDATA block to every field indiscriminately, but I
 would apply it to varchar and text fields where the data is likely to
 be quite variable.

 I'll keep grinding on trying to regex the null character out of there and
 let the list know if I figure anything out.

 The problem with this approach is that while it's currently the null
 character, next time it might be something else and then something
 else.  Your regex could just continue to grow.  I guess what I'm
 saying is that it's not really a scalable solution.

 Handling invalid character in a batch manner by including them in a
 CDATA block or by understanding how those characters are being
 inserted is a more workable long term solution.

 That said, adding this final character may turn out to be the last you
 ever hear of this particular problem.  :-)

 



~|
Introducing the Fusion Authority Quarterly Update. 80 pages of hard-hitting,
up-to-date ColdFusion information by your peers, delivered to your door four 
times a year.
http://www.fusionauthority.com/quarterly

Archive: 
http://www.houseoffusion.com/groups/CF-Talk/message.cfm/messageid:259505
Subscription: http://www.houseoffusion.com/groups/CF-Talk/subscribe.cfm
Unsubscribe: 
http://www.houseoffusion.com/cf_lists/unsubscribe.cfm?user=11502.10531.4


Re: Cleaning XML - Unicode 0x0 SOLVED sorta

2006-11-07 Thread Rob Wilkerson
On 11/7/06, Matt Quackenbush [EMAIL PROTECTED] wrote:
 Josh,

 I think the point that Rob and others were making is that your data should
 be validated and cleaned up BEFORE being inserted into the database -
 whether it's inserted as XML or not is completely and utterly irrelevant.

That's not exactly what I was saying, but I do agree that it's a good
practice when possible.  On the whole, though, I'm a proponent of less
rather than more restriction on what can be entered.  Some data is
restrictive by its very nature (e.g. price, quantity, etc.), but other
data is very unstructured (e.g. name, title, description).  In the
latter case, I prefer to try to keep it as it was entered and then
handle it when it's used - preferably without modification.

 If you didn't have invalid data in the database, then you wouldn't have
 invalid data in your XML.  But, since the data obviously is NOT being
 validated and cleaned up before db entry, the best, most scalable, and most
 widely accepted good practice would be to use CDATA in your XML.

Exactly.  Any number of characters can creep into that unstructured
text I mentioned above.  A LOT of them if the text is copied and
pasted from MSWord.  Those characters can either be stripped
one-by-one using REReplace() or another similar method or you can
simply allow them in your XML by enclosing them in a CDATA block.  The
latter is much easier and retains the data exactly as it was entered.

 Again though, what you're doing is just a bandaid that covers up the real
 issue, which is invalid data being entered into the database.

In Josh's case, I don't think I have a good sense of what kind of data
he's got nor of the process in which he's using it so the best I could
do was throw out generic options.  Hopefully they made enough sense
that he'll be able to use them if he feels the need to do so.

~|
Introducing the Fusion Authority Quarterly Update. 80 pages of hard-hitting,
up-to-date ColdFusion information by your peers, delivered to your door four 
times a year.
http://www.fusionauthority.com/quarterly

Archive: 
http://www.houseoffusion.com/groups/CF-Talk/message.cfm/messageid:259535
Subscription: http://www.houseoffusion.com/groups/CF-Talk/subscribe.cfm
Unsubscribe: 
http://www.houseoffusion.com/cf_lists/unsubscribe.cfm?user=11502.10531.4


Re: Cleaning XML - Unicode 0x0 SOLVED sorta

2006-11-07 Thread Josh Nathanson
Rob, you had a good idea of my situation.  It's basically name, address etc. 
that shoppers enter when they are buying our product.  Then after the data 
is in the db, I output it as xml for a third party application.  Somehow, 
control characters sneak into that form data from time to time.

I do understand that it's best practice to clean the data before it goes 
into the db.  However in my particular case, it doesn't make a whole bunch 
of difference whether I clean it before or after.

Maybe I can summarize:
1) CDATA is not helpful when encountering control characters.
2) Thus, I have to use rereplace with all the known control characters that 
have broken the xml in the past (CF tells you which character that is the 
problem, in Unicode)
3) If I did the rereplace on the way into the db, it still may not catch all 
offending control characters.  There may be a new one that isn't in the 
regex yet.  Additionally, I don't want to disrupt the shopper's checkout 
process if at all possible.
4) Thus, the data with the new control character would still go into the db 
and break the xml on the way out of the db.
5) So, I may as well just do it on the way out of the db, where I don't have 
to worry about disrupting a shopper when they are about to buy something 
(can you imagine the error message: Sorry, we have detected an invisible 
character in your address.  Please remove it and re-submit.)

-- Josh


- Original Message - 
From: Rob Wilkerson [EMAIL PROTECTED]
To: CF-Talk cf-talk@houseoffusion.com
Sent: Tuesday, November 07, 2006 2:04 PM
Subject: Re: Cleaning XML - Unicode 0x0 SOLVED sorta


 On 11/7/06, Matt Quackenbush [EMAIL PROTECTED] wrote:
 Josh,

 I think the point that Rob and others were making is that your data 
 should
 be validated and cleaned up BEFORE being inserted into the database -
 whether it's inserted as XML or not is completely and utterly irrelevant.

 That's not exactly what I was saying, but I do agree that it's a good
 practice when possible.  On the whole, though, I'm a proponent of less
 rather than more restriction on what can be entered.  Some data is
 restrictive by its very nature (e.g. price, quantity, etc.), but other
 data is very unstructured (e.g. name, title, description).  In the
 latter case, I prefer to try to keep it as it was entered and then
 handle it when it's used - preferably without modification.

 If you didn't have invalid data in the database, then you wouldn't have
 invalid data in your XML.  But, since the data obviously is NOT being
 validated and cleaned up before db entry, the best, most scalable, and 
 most
 widely accepted good practice would be to use CDATA in your XML.

 Exactly.  Any number of characters can creep into that unstructured
 text I mentioned above.  A LOT of them if the text is copied and
 pasted from MSWord.  Those characters can either be stripped
 one-by-one using REReplace() or another similar method or you can
 simply allow them in your XML by enclosing them in a CDATA block.  The
 latter is much easier and retains the data exactly as it was entered.

 Again though, what you're doing is just a bandaid that covers up the real
 issue, which is invalid data being entered into the database.

 In Josh's case, I don't think I have a good sense of what kind of data
 he's got nor of the process in which he's using it so the best I could
 do was throw out generic options.  Hopefully they made enough sense
 that he'll be able to use them if he feels the need to do so.

 

~|
Introducing the Fusion Authority Quarterly Update. 80 pages of hard-hitting,
up-to-date ColdFusion information by your peers, delivered to your door four 
times a year.
http://www.fusionauthority.com/quarterly

Archive: 
http://www.houseoffusion.com/groups/CF-Talk/message.cfm/messageid:259538
Subscription: http://www.houseoffusion.com/groups/CF-Talk/subscribe.cfm
Unsubscribe: 
http://www.houseoffusion.com/cf_lists/unsubscribe.cfm?user=11502.10531.4


Re: Cleaning XML - Unicode 0x0 SOLVED sorta

2006-11-07 Thread Rob Wilkerson
On 11/7/06, Josh Nathanson [EMAIL PROTECTED] wrote:

 Maybe I can summarize:
 1) CDATA is not helpful when encountering control characters.

True.  Does lead you to wonder, though, how they're sneaking in there.
 Folks don't just type in null characters...

 2) Thus, I have to use rereplace with all the known control characters that
 have broken the xml in the past (CF tells you which character that is the
 problem, in Unicode)

I still think there's a better way.  This is painting with a pretty
broad brush, but this regex will simply remove all characters that are
not in the ASCII range:

REReplace ( mystring, '[^\x00-\x7f]', '', 'ALL' )

Again, it's a pretty broad brush, but it shouldn't be too hard to
narrow the focus to all non-printing characters.  And it's at least a
little more scalable.

 3) If I did the rereplace on the way into the db, it still may not catch all
 offending control characters.  There may be a new one that isn't in the
 regex yet.  Additionally, I don't want to disrupt the shopper's checkout
 process if at all possible.

The regex above may help prevent you from having to add more on a one-off basis.

~|
Introducing the Fusion Authority Quarterly Update. 80 pages of hard-hitting,
up-to-date ColdFusion information by your peers, delivered to your door four 
times a year.
http://www.fusionauthority.com/quarterly

Archive: 
http://www.houseoffusion.com/groups/CF-Talk/message.cfm/messageid:259557
Subscription: http://www.houseoffusion.com/groups/CF-Talk/subscribe.cfm
Unsubscribe: http://www.houseoffusion.com/cf_lists/unsubscribe.cfm?user=89.70.4


Re: Cleaning XML - Unicode 0x0 SOLVED sorta

2006-11-07 Thread Josh Nathanson
Hey Rob,

 True.  Does lead you to wonder, though, how they're sneaking in there.
 Folks don't just type in null characters...

It leads me to wonder allright!!  Maybe forms autofill or something?

 REReplace ( mystring, '[^\x00-\x7f]', '', 'ALL' )
 Again, it's a pretty broad brush, but it shouldn't be too hard to
 narrow the focus to all non-printing characters.  And it's at least a
 little more scalable.

Cool thanks, I'll try that regex.  Do I need the caret at the beginning of 
the range though?  I'm looking to match and replace the non-printing 
characters, not everything except the non-printing characters...or am I 
misunderstanding that?

-- Josh





 

~|
Introducing the Fusion Authority Quarterly Update. 80 pages of hard-hitting,
up-to-date ColdFusion information by your peers, delivered to your door four 
times a year.
http://www.fusionauthority.com/quarterly

Archive: 
http://www.houseoffusion.com/groups/CF-Talk/message.cfm/messageid:259560
Subscription: http://www.houseoffusion.com/groups/CF-Talk/subscribe.cfm
Unsubscribe: http://www.houseoffusion.com/cf_lists/unsubscribe.cfm?user=89.70.4


Re: Cleaning XML - Unicode 0x0 SOLVED sorta

2006-11-07 Thread Paul Hastings
Josh Nathanson wrote:
 3) If I did the rereplace on the way into the db, it still may not catch all 
 offending control characters.  There may be a new one that isn't in the 
 regex yet.  Additionally, I don't want to disrupt the shopper's checkout 
 process if at all possible.

there can't be. the control chars are a fixed, known quantity. not a cf regex 
expert but w/a java Pattern you can pull out {Cntrl} for US-ASCII or better yet 
use the unicode control char block (Cc) which goes a bit deeper:

'\u' through '\u0008'
'\u000E' through '\u001B'
'\u007F' through '\u009F'

you could also test whether the char is ignorable. have a look at the java 
docs for Character  Character.UnicodeBlock.

 5) So, I may as well just do it on the way out of the db, where I don't have 
 to worry about disrupting a shopper when they are about to buy something 
 (can you imagine the error message: Sorry, we have detected an invisible 
 character in your address.  Please remove it and re-submit.)

why do it that way? why not just remove the chars silently?

~|
Introducing the Fusion Authority Quarterly Update. 80 pages of hard-hitting,
up-to-date ColdFusion information by your peers, delivered to your door four 
times a year.
http://www.fusionauthority.com/quarterly

Archive: 
http://www.houseoffusion.com/groups/CF-Talk/message.cfm/messageid:259563
Subscription: http://www.houseoffusion.com/groups/CF-Talk/subscribe.cfm
Unsubscribe: http://www.houseoffusion.com/cf_lists/unsubscribe.cfm?user=89.70.4


Re: Cleaning XML - Unicode 0x0 SOLVED sorta

2006-11-07 Thread Rob Wilkerson
On 11/7/06, Josh Nathanson [EMAIL PROTECTED] wrote:
 Hey Rob,

  True.  Does lead you to wonder, though, how they're sneaking in there.
  Folks don't just type in null characters...

 It leads me to wonder allright!!  Maybe forms autofill or something?

I wish I had an answer for you.  It's gotta be something, but I've
never run into the problem.  Do the control characters exist within
the database or is there a chance that they're being added when
retrieved?

  REReplace ( mystring, '[^\x00-\x7f]', '', 'ALL' )
  Again, it's a pretty broad brush, but it shouldn't be too hard to
  narrow the focus to all non-printing characters.  And it's at least a
  little more scalable.

 Cool thanks, I'll try that regex.  Do I need the caret at the beginning of
 the range though?  I'm looking to match and replace the non-printing
 characters, not everything except the non-printing characters...or am I
 misunderstanding that?

That is a hex representation of the ascii range.  I used that regex
for something I was working on a while back.  You'd need to tweak it
out for control/non-printing characters.  I don't know the values off
the top of my head, but a good ascii table can be found at
http://www.subterrane.com/files/asciitable.

~|
Introducing the Fusion Authority Quarterly Update. 80 pages of hard-hitting,
up-to-date ColdFusion information by your peers, delivered to your door four 
times a year.
http://www.fusionauthority.com/quarterly

Archive: 
http://www.houseoffusion.com/groups/CF-Talk/message.cfm/messageid:259565
Subscription: http://www.houseoffusion.com/groups/CF-Talk/subscribe.cfm
Unsubscribe: http://www.houseoffusion.com/cf_lists/unsubscribe.cfm?user=89.70.4