Re: Cleaning XML - Unicode 0x0 SOLVED sorta
OK, I added this to my regex: \x00 Which is a hex representation of the character 0. And it worked. Not sure why chr(0) didn't work. Yes it's non scalable...but, since the data is not going into the database as xml, just plain old form fields, I can't use CDATA on the way in anyway, correct? I would have to run the same regex on each of the incoming form fields that are text...so, this way is more scalable than that I guess. -- Josh - Original Message - From: Rob Wilkerson [EMAIL PROTECTED] To: CF-Talk cf-talk@houseoffusion.com Sent: Tuesday, November 07, 2006 10:19 AM Subject: Re: Cleaning XML - Unicode 0x0 On 11/7/06, Josh Nathanson [EMAIL PROTECTED] wrote: Thanks for your help Rob. I just don't know which field is the culprit as far as the null character (there's no description field or anything obvious like that), and I'm hesitant to CDATA every single field that's going into the db, unless I've exhausted every possible other option. I wouldn't apply a CDATA block to every field indiscriminately, but I would apply it to varchar and text fields where the data is likely to be quite variable. I'll keep grinding on trying to regex the null character out of there and let the list know if I figure anything out. The problem with this approach is that while it's currently the null character, next time it might be something else and then something else. Your regex could just continue to grow. I guess what I'm saying is that it's not really a scalable solution. Handling invalid character in a batch manner by including them in a CDATA block or by understanding how those characters are being inserted is a more workable long term solution. That said, adding this final character may turn out to be the last you ever hear of this particular problem. :-) ~| Introducing the Fusion Authority Quarterly Update. 80 pages of hard-hitting, up-to-date ColdFusion information by your peers, delivered to your door four times a year. http://www.fusionauthority.com/quarterly Archive: http://www.houseoffusion.com/groups/CF-Talk/message.cfm/messageid:259480 Subscription: http://www.houseoffusion.com/groups/CF-Talk/subscribe.cfm Unsubscribe: http://www.houseoffusion.com/cf_lists/unsubscribe.cfm?user=11502.10531.4
Re: Cleaning XML - Unicode 0x0 SOLVED sorta
On 11/7/06, Josh Nathanson [EMAIL PROTECTED] wrote: Yes it's non scalable...but, since the data is not going into the database as xml, just plain old form fields, I can't use CDATA on the way in anyway, correct? I would have to run the same regex on each of the incoming form fields that are text...so, this way is more scalable than that I guess. Maybe I'm not clear about how you're using XML. Are you extracting data from your DB into an XML format or doing something else? I assumed you were formatted data from the DB into XML for the purpose of delivering it somewhere else. If that's the case, the CDATA - while it may or may not have solved this particular problem - is safer than what you're doing and scalable. Of course, I'm just beating a dead horse if you've already found something you're happy with. :-) ~| Introducing the Fusion Authority Quarterly Update. 80 pages of hard-hitting, up-to-date ColdFusion information by your peers, delivered to your door four times a year. http://www.fusionauthority.com/quarterly Archive: http://www.houseoffusion.com/groups/CF-Talk/message.cfm/messageid:259500 Subscription: http://www.houseoffusion.com/groups/CF-Talk/subscribe.cfm Unsubscribe: http://www.houseoffusion.com/cf_lists/unsubscribe.cfm?user=89.70.4
RE: Cleaning XML - Unicode 0x0 SOLVED sorta
Josh, I think the point that Rob and others were making is that your data should be validated and cleaned up BEFORE being inserted into the database - whether it's inserted as XML or not is completely and utterly irrelevant. If you didn't have invalid data in the database, then you wouldn't have invalid data in your XML. But, since the data obviously is NOT being validated and cleaned up before db entry, the best, most scalable, and most widely accepted good practice would be to use CDATA in your XML. Again though, what you're doing is just a bandaid that covers up the real issue, which is invalid data being entered into the database. Thanks, Matt -Original Message- From: Josh Nathanson [mailto:[EMAIL PROTECTED] Sent: Tuesday, November 07, 2006 1:14 PM To: CF-Talk Subject: Re: Cleaning XML - Unicode 0x0 SOLVED sorta OK, I added this to my regex: \x00 Which is a hex representation of the character 0. And it worked. Not sure why chr(0) didn't work. Yes it's non scalable...but, since the data is not going into the database as xml, just plain old form fields, I can't use CDATA on the way in anyway, correct? I would have to run the same regex on each of the incoming form fields that are text...so, this way is more scalable than that I guess. -- Josh - Original Message - From: Rob Wilkerson [EMAIL PROTECTED] To: CF-Talk cf-talk@houseoffusion.com Sent: Tuesday, November 07, 2006 10:19 AM Subject: Re: Cleaning XML - Unicode 0x0 On 11/7/06, Josh Nathanson [EMAIL PROTECTED] wrote: Thanks for your help Rob. I just don't know which field is the culprit as far as the null character (there's no description field or anything obvious like that), and I'm hesitant to CDATA every single field that's going into the db, unless I've exhausted every possible other option. I wouldn't apply a CDATA block to every field indiscriminately, but I would apply it to varchar and text fields where the data is likely to be quite variable. I'll keep grinding on trying to regex the null character out of there and let the list know if I figure anything out. The problem with this approach is that while it's currently the null character, next time it might be something else and then something else. Your regex could just continue to grow. I guess what I'm saying is that it's not really a scalable solution. Handling invalid character in a batch manner by including them in a CDATA block or by understanding how those characters are being inserted is a more workable long term solution. That said, adding this final character may turn out to be the last you ever hear of this particular problem. :-) ~| Introducing the Fusion Authority Quarterly Update. 80 pages of hard-hitting, up-to-date ColdFusion information by your peers, delivered to your door four times a year. http://www.fusionauthority.com/quarterly Archive: http://www.houseoffusion.com/groups/CF-Talk/message.cfm/messageid:259505 Subscription: http://www.houseoffusion.com/groups/CF-Talk/subscribe.cfm Unsubscribe: http://www.houseoffusion.com/cf_lists/unsubscribe.cfm?user=11502.10531.4
Re: Cleaning XML - Unicode 0x0 SOLVED sorta
On 11/7/06, Matt Quackenbush [EMAIL PROTECTED] wrote: Josh, I think the point that Rob and others were making is that your data should be validated and cleaned up BEFORE being inserted into the database - whether it's inserted as XML or not is completely and utterly irrelevant. That's not exactly what I was saying, but I do agree that it's a good practice when possible. On the whole, though, I'm a proponent of less rather than more restriction on what can be entered. Some data is restrictive by its very nature (e.g. price, quantity, etc.), but other data is very unstructured (e.g. name, title, description). In the latter case, I prefer to try to keep it as it was entered and then handle it when it's used - preferably without modification. If you didn't have invalid data in the database, then you wouldn't have invalid data in your XML. But, since the data obviously is NOT being validated and cleaned up before db entry, the best, most scalable, and most widely accepted good practice would be to use CDATA in your XML. Exactly. Any number of characters can creep into that unstructured text I mentioned above. A LOT of them if the text is copied and pasted from MSWord. Those characters can either be stripped one-by-one using REReplace() or another similar method or you can simply allow them in your XML by enclosing them in a CDATA block. The latter is much easier and retains the data exactly as it was entered. Again though, what you're doing is just a bandaid that covers up the real issue, which is invalid data being entered into the database. In Josh's case, I don't think I have a good sense of what kind of data he's got nor of the process in which he's using it so the best I could do was throw out generic options. Hopefully they made enough sense that he'll be able to use them if he feels the need to do so. ~| Introducing the Fusion Authority Quarterly Update. 80 pages of hard-hitting, up-to-date ColdFusion information by your peers, delivered to your door four times a year. http://www.fusionauthority.com/quarterly Archive: http://www.houseoffusion.com/groups/CF-Talk/message.cfm/messageid:259535 Subscription: http://www.houseoffusion.com/groups/CF-Talk/subscribe.cfm Unsubscribe: http://www.houseoffusion.com/cf_lists/unsubscribe.cfm?user=11502.10531.4
Re: Cleaning XML - Unicode 0x0 SOLVED sorta
Rob, you had a good idea of my situation. It's basically name, address etc. that shoppers enter when they are buying our product. Then after the data is in the db, I output it as xml for a third party application. Somehow, control characters sneak into that form data from time to time. I do understand that it's best practice to clean the data before it goes into the db. However in my particular case, it doesn't make a whole bunch of difference whether I clean it before or after. Maybe I can summarize: 1) CDATA is not helpful when encountering control characters. 2) Thus, I have to use rereplace with all the known control characters that have broken the xml in the past (CF tells you which character that is the problem, in Unicode) 3) If I did the rereplace on the way into the db, it still may not catch all offending control characters. There may be a new one that isn't in the regex yet. Additionally, I don't want to disrupt the shopper's checkout process if at all possible. 4) Thus, the data with the new control character would still go into the db and break the xml on the way out of the db. 5) So, I may as well just do it on the way out of the db, where I don't have to worry about disrupting a shopper when they are about to buy something (can you imagine the error message: Sorry, we have detected an invisible character in your address. Please remove it and re-submit.) -- Josh - Original Message - From: Rob Wilkerson [EMAIL PROTECTED] To: CF-Talk cf-talk@houseoffusion.com Sent: Tuesday, November 07, 2006 2:04 PM Subject: Re: Cleaning XML - Unicode 0x0 SOLVED sorta On 11/7/06, Matt Quackenbush [EMAIL PROTECTED] wrote: Josh, I think the point that Rob and others were making is that your data should be validated and cleaned up BEFORE being inserted into the database - whether it's inserted as XML or not is completely and utterly irrelevant. That's not exactly what I was saying, but I do agree that it's a good practice when possible. On the whole, though, I'm a proponent of less rather than more restriction on what can be entered. Some data is restrictive by its very nature (e.g. price, quantity, etc.), but other data is very unstructured (e.g. name, title, description). In the latter case, I prefer to try to keep it as it was entered and then handle it when it's used - preferably without modification. If you didn't have invalid data in the database, then you wouldn't have invalid data in your XML. But, since the data obviously is NOT being validated and cleaned up before db entry, the best, most scalable, and most widely accepted good practice would be to use CDATA in your XML. Exactly. Any number of characters can creep into that unstructured text I mentioned above. A LOT of them if the text is copied and pasted from MSWord. Those characters can either be stripped one-by-one using REReplace() or another similar method or you can simply allow them in your XML by enclosing them in a CDATA block. The latter is much easier and retains the data exactly as it was entered. Again though, what you're doing is just a bandaid that covers up the real issue, which is invalid data being entered into the database. In Josh's case, I don't think I have a good sense of what kind of data he's got nor of the process in which he's using it so the best I could do was throw out generic options. Hopefully they made enough sense that he'll be able to use them if he feels the need to do so. ~| Introducing the Fusion Authority Quarterly Update. 80 pages of hard-hitting, up-to-date ColdFusion information by your peers, delivered to your door four times a year. http://www.fusionauthority.com/quarterly Archive: http://www.houseoffusion.com/groups/CF-Talk/message.cfm/messageid:259538 Subscription: http://www.houseoffusion.com/groups/CF-Talk/subscribe.cfm Unsubscribe: http://www.houseoffusion.com/cf_lists/unsubscribe.cfm?user=11502.10531.4
Re: Cleaning XML - Unicode 0x0 SOLVED sorta
On 11/7/06, Josh Nathanson [EMAIL PROTECTED] wrote: Maybe I can summarize: 1) CDATA is not helpful when encountering control characters. True. Does lead you to wonder, though, how they're sneaking in there. Folks don't just type in null characters... 2) Thus, I have to use rereplace with all the known control characters that have broken the xml in the past (CF tells you which character that is the problem, in Unicode) I still think there's a better way. This is painting with a pretty broad brush, but this regex will simply remove all characters that are not in the ASCII range: REReplace ( mystring, '[^\x00-\x7f]', '', 'ALL' ) Again, it's a pretty broad brush, but it shouldn't be too hard to narrow the focus to all non-printing characters. And it's at least a little more scalable. 3) If I did the rereplace on the way into the db, it still may not catch all offending control characters. There may be a new one that isn't in the regex yet. Additionally, I don't want to disrupt the shopper's checkout process if at all possible. The regex above may help prevent you from having to add more on a one-off basis. ~| Introducing the Fusion Authority Quarterly Update. 80 pages of hard-hitting, up-to-date ColdFusion information by your peers, delivered to your door four times a year. http://www.fusionauthority.com/quarterly Archive: http://www.houseoffusion.com/groups/CF-Talk/message.cfm/messageid:259557 Subscription: http://www.houseoffusion.com/groups/CF-Talk/subscribe.cfm Unsubscribe: http://www.houseoffusion.com/cf_lists/unsubscribe.cfm?user=89.70.4
Re: Cleaning XML - Unicode 0x0 SOLVED sorta
Hey Rob, True. Does lead you to wonder, though, how they're sneaking in there. Folks don't just type in null characters... It leads me to wonder allright!! Maybe forms autofill or something? REReplace ( mystring, '[^\x00-\x7f]', '', 'ALL' ) Again, it's a pretty broad brush, but it shouldn't be too hard to narrow the focus to all non-printing characters. And it's at least a little more scalable. Cool thanks, I'll try that regex. Do I need the caret at the beginning of the range though? I'm looking to match and replace the non-printing characters, not everything except the non-printing characters...or am I misunderstanding that? -- Josh ~| Introducing the Fusion Authority Quarterly Update. 80 pages of hard-hitting, up-to-date ColdFusion information by your peers, delivered to your door four times a year. http://www.fusionauthority.com/quarterly Archive: http://www.houseoffusion.com/groups/CF-Talk/message.cfm/messageid:259560 Subscription: http://www.houseoffusion.com/groups/CF-Talk/subscribe.cfm Unsubscribe: http://www.houseoffusion.com/cf_lists/unsubscribe.cfm?user=89.70.4
Re: Cleaning XML - Unicode 0x0 SOLVED sorta
Josh Nathanson wrote: 3) If I did the rereplace on the way into the db, it still may not catch all offending control characters. There may be a new one that isn't in the regex yet. Additionally, I don't want to disrupt the shopper's checkout process if at all possible. there can't be. the control chars are a fixed, known quantity. not a cf regex expert but w/a java Pattern you can pull out {Cntrl} for US-ASCII or better yet use the unicode control char block (Cc) which goes a bit deeper: '\u' through '\u0008' '\u000E' through '\u001B' '\u007F' through '\u009F' you could also test whether the char is ignorable. have a look at the java docs for Character Character.UnicodeBlock. 5) So, I may as well just do it on the way out of the db, where I don't have to worry about disrupting a shopper when they are about to buy something (can you imagine the error message: Sorry, we have detected an invisible character in your address. Please remove it and re-submit.) why do it that way? why not just remove the chars silently? ~| Introducing the Fusion Authority Quarterly Update. 80 pages of hard-hitting, up-to-date ColdFusion information by your peers, delivered to your door four times a year. http://www.fusionauthority.com/quarterly Archive: http://www.houseoffusion.com/groups/CF-Talk/message.cfm/messageid:259563 Subscription: http://www.houseoffusion.com/groups/CF-Talk/subscribe.cfm Unsubscribe: http://www.houseoffusion.com/cf_lists/unsubscribe.cfm?user=89.70.4
Re: Cleaning XML - Unicode 0x0 SOLVED sorta
On 11/7/06, Josh Nathanson [EMAIL PROTECTED] wrote: Hey Rob, True. Does lead you to wonder, though, how they're sneaking in there. Folks don't just type in null characters... It leads me to wonder allright!! Maybe forms autofill or something? I wish I had an answer for you. It's gotta be something, but I've never run into the problem. Do the control characters exist within the database or is there a chance that they're being added when retrieved? REReplace ( mystring, '[^\x00-\x7f]', '', 'ALL' ) Again, it's a pretty broad brush, but it shouldn't be too hard to narrow the focus to all non-printing characters. And it's at least a little more scalable. Cool thanks, I'll try that regex. Do I need the caret at the beginning of the range though? I'm looking to match and replace the non-printing characters, not everything except the non-printing characters...or am I misunderstanding that? That is a hex representation of the ascii range. I used that regex for something I was working on a while back. You'd need to tweak it out for control/non-printing characters. I don't know the values off the top of my head, but a good ascii table can be found at http://www.subterrane.com/files/asciitable. ~| Introducing the Fusion Authority Quarterly Update. 80 pages of hard-hitting, up-to-date ColdFusion information by your peers, delivered to your door four times a year. http://www.fusionauthority.com/quarterly Archive: http://www.houseoffusion.com/groups/CF-Talk/message.cfm/messageid:259565 Subscription: http://www.houseoffusion.com/groups/CF-Talk/subscribe.cfm Unsubscribe: http://www.houseoffusion.com/cf_lists/unsubscribe.cfm?user=89.70.4