I don’t think the file you attached is UTF8.  I have a number of tools I use to 
identify encoding, and it looks like ANSI to me.  I can save it as UTF8 and the 
diacritics for display, don’t change, but the underlying code values most 
certainly do.

For example – you your CSV file – the following data:
New Riverside CafŽ
At a binary level, the diacritic is represented as: 0x8e 0x2c

As a UTF8 file, that same diacritic would be represented as:
0xc5 0xbd

--tr

From: archivesspace_users_group-boun...@lyralists.lyrasis.org 
[mailto:archivesspace_users_group-boun...@lyralists.lyrasis.org] On Behalf Of 
Lisa Calahan
Sent: Wednesday, February 15, 2017 4:26 PM
To: Archivesspace Users Group <archivesspace_users_group@lyralists.lyrasis.org>
Subject: Re: [Archivesspace_Users_Group] Enumerations Findings

I've attached the .csv example. I didn't test it in 1.5.3, but the bug occurs 
in 1.5.2 (I know it did not occur in 1.5.1). I reported the bug on January 17.

On Wed, Feb 15, 2017 at 3:04 PM, Majewski, Steven Dennis (sdm7g) 
<sd...@eservices.virginia.edu<mailto:sd...@eservices.virginia.edu>> wrote:
Yes, and the previous cases I’ve seen ( which have since been fixed ) have been 
where the document was originally parsed with correct character encoding, but 
that encoding wasn’t being preserved on some other
( xml or json ) internal transform. So that might be something to look for if 
it’s still happening in a new use case.



On Feb 15, 2017, at 3:54 PM, Reese, Terry P. 
<reese.2...@osu.edu<mailto:reese.2...@osu.edu>> wrote:

I’d be interested in the same thing (a sample file).  I’m familiar with the 
tools being used, and if the data is UTF8, then you shouldn’t see this problem 
unless the import is munging the data or encoding – which would be a much 
different problem.

--tr

From: 
archivesspace_users_group-boun...@lyralists.lyrasis.org<mailto:archivesspace_users_group-boun...@lyralists.lyrasis.org>
 [mailto:archivesspace_users_group-boun...@lyralists.lyrasis.org] On Behalf Of 
Majewski, Steven Dennis (sdm7g)
Sent: Wednesday, February 15, 2017 3:50 PM
To: Archivesspace Users Group 
<archivesspace_users_group@lyralists.lyrasis.org<mailto:archivesspace_users_group@lyralists.lyrasis.org>>
Subject: Re: [Archivesspace_Users_Group] Enumerations Findings


Do you have a sample import file that fails this way ?
Do you know if it still fail on current release ?
( and is bug reported on Jira ? )

— Steve.


On Feb 15, 2017, at 3:25 PM, Lisa Calahan 
<lcala...@umn.edu<mailto:lcala...@umn.edu>> wrote:

I've also received the same UTF8 error when importing legacy accession records 
that have validdiacritical marks in the title and/or agent name.

Lisa

On Wed, Feb 15, 2017 at 2:17 PM, Reese, Terry P. 
<reese.2...@osu.edu<mailto:reese.2...@osu.edu>> wrote:
I guess my question would be – is your legacy data UTF8?  For whatever reason, 
I’ve found that historically, Archives have often used other charactersets when 
encoding their EAD files (though to be fair, I see this in MARC records as 
well; confusion between MARC8, ISO8859-1, and codepage 1252).  The simply 
solution (and this would maintain your characters) would be to convert the 
character set to UTF8.  Otherwise, even if you held on to these values – they 
wouldn’t display in any form that you could read; and in fact, that is what the 
error message is trying to tell you.  That as a UTF8 value, your data is going 
to be gibberish, regardless of if you keep it or not.

--tr

From: 
archivesspace_users_group-boun...@lyralists.lyrasis.org<mailto:archivesspace_users_group-boun...@lyralists.lyrasis.org>
 
[mailto:archivesspace_users_group-boun...@lyralists.lyrasis.org<mailto:archivesspace_users_group-boun...@lyralists.lyrasis.org>]
 On Behalf Of Stasiulatis, Suzanne
Sent: Wednesday, February 15, 2017 3:12 PM

To: Archivesspace Users Group 
<archivesspace_users_group@lyralists.lyrasis.org<mailto:archivesspace_users_group@lyralists.lyrasis.org>>
Subject: Re: [Archivesspace_Users_Group] Enumerations Findings


I totally agree that we shouldn’t have special characters if at all possible, 
but a large amount of our legacy data uses them. Especially in titles, staff 
want to use those characters as they are reflected on original materials.

Suzanne

From: 
archivesspace_users_group-boun...@lyralists.lyrasis.org<mailto:archivesspace_users_group-boun...@lyralists.lyrasis.org>
 [mailto:archivesspace_users_group-boun...@lyralists.lyrasis.org] On Behalf Of 
Reese, Terry P.
Sent: Wednesday, February 15, 2017 2:58 PM
To: Archivesspace Users Group
Subject: Re: [Archivesspace_Users_Group] Enumerations Findings

Why would you want to retain invalid special characters?  My guess is that one 
of the reasons for this error is that invalid characters would cause problems 
with indexing for search, as well as impact display and export.  I would think 
you’d want to use the error as a flag to identify data that needs to be 
corrected.  Or am I missing something?

--tr

From: 
archivesspace_users_group-boun...@lyralists.lyrasis.org<mailto:archivesspace_users_group-boun...@lyralists.lyrasis.org>
 [mailto:archivesspace_users_group-boun...@lyralists.lyrasis.org] On Behalf Of 
Stasiulatis, Suzanne
Sent: Wednesday, February 15, 2017 2:52 PM
To: Archivesspace Users Group 
<archivesspace_users_group@lyralists.lyrasis.org<mailto:archivesspace_users_group@lyralists.lyrasis.org>>
Subject: Re: [Archivesspace_Users_Group] Enumerations Findings

This also came up for me recently. If invalid special characters are present in 
the content titles, I get this error. I’m not sure quite how to adjust to 
accept those special characters.

<image002.png>

Suzanne Stasiulatis | Archivist II
Pennsylvania Historical and Museum Commission | Pennsylvania State Archives
350 North Street | Harrisburg, PA 17120-0090
Phone: 717-787-5953<tel:(717)%20787-5953>
http://www.phmc.pa.gov<http://www.phmc.pa.gov/>
sustasi...@pa.gov<mailto:sustasi...@pa.gov>

From: 
archivesspace_users_group-boun...@lyralists.lyrasis.org<mailto:archivesspace_users_group-boun...@lyralists.lyrasis.org>
 [mailto:archivesspace_users_group-boun...@lyralists.lyrasis.org] On Behalf Of 
Majewski, Steven Dennis (sdm7g)
Sent: Wednesday, February 15, 2017 2:36 PM
To: Archivesspace Users Group
Subject: Re: [Archivesspace_Users_Group] Enumerations Findings



We have run into the case that some EAD attribute values are required to be 
NMTOKENs, thus no embedded spaces or other disallowed characters. We replaced 
enumerations with embedded spaces with underscores.

This has only come to my attention in the last week or so, so I haven’t made a 
thorough investigation of which attributes or which enumerations this applies 
to — just fixed them as I’ve encountered that error.

So it may be intentional that it is using the non translated value.
( And I wouldn’t be surprised, if for simplicity, it may be over applying that 
rule in places where it’s not actually required. )


— Steve.


On Feb 15, 2017, at 2:09 PM, Carlos Lemus 
<carlos.le...@unlv.edu<mailto:carlos.le...@unlv.edu>> wrote:

Hello,

At UNLV Special Collections, we've been working on cleaning up our enumeration 
values because in many cases there were duplicates caused by imports (i.e 
value: linear_feet vs value: Linear feet vs Linear Feet). We wanted to stick as 
close as possible to ArchivesSpace standards and decided to make our 
enumeration values all lowercase seperated by an underscore and then merge any 
records with incorrect enumerations into that correct value (i.e value: linear 
Feet into linear_feet). We also have some custom enumerations such as: value: 
oversized_box, translation: Oversized Box; digital_file; Digital File

After we had that set up correctly, we had some findings and was wondering if 
anyone has experienced the same things or had a standard we could use.

1. When generating PDFs and EADs the enumeration values that were custom (such 
as the oversized_box) would come out as machine readable oversized_box instead 
of using our local en.yml value (located in the local plugin).
     This was something I found in the EAD serializer 
(https://github.com/archivesspace/archivesspace/blob/master/backend/app/exporters/serializers/ead.rb#L490)
 and was able to create a temporary solution of generating it , but required 
altering the enumeration instead of referencing our file. I thought i'd point 
it out because anyone creating custom enumerations even with a translation in 
an en.yml  file would not see their change reflected in the EAD export. (I've 
attached an image reflecting this) Anyone experience this?

2. Another example of this case was in the container "type" attribute. Before 
something like Oversized Box would be export to EAD as is because that was it's 
value in the enumeration. After we changed the value correctly to 
oversized_box, it would export to the EAD container "type" as is and translate 
to the PDF as well. With some XSLT manipulation I was able to get it to show up 
as oversized box (shown in attachments). I've looked through 
https://www.loc.gov/ead/tglib/elements/container.html and cannot find an 
example of a two+ attribute value.

Should attributes be machine readable (i.e oversized_box), human readable 
(Oversized Box), or does it even matter? Of course, exporting it as Oversized 
Box would be easiest to translate a user friendly version to the user.

Excuse me for the lengthy post, I'm trying to be thorough with my explenation, 
but please let me know if you've come accross something similar or have a 
finite solution.

Carlos Lemus
Application Programmer, Special Collections Technical Services
University Libraries, University of Nevada, Las Vegas

How often have I said to you that when you have eliminated the impossible, 
whatever remains, however improbable, must be the truth? - Sherlock Holmes
<enumeration_ead.PNG><containers_enum.PNG>_______________________________________________
Archivesspace_Users_Group mailing list
Archivesspace_Users_Group@lyralists.lyrasis.org<mailto:Archivesspace_Users_Group@lyralists.lyrasis.org>
http://lyralists.lyrasis.org/mailman/listinfo/archivesspace_users_group


_______________________________________________
Archivesspace_Users_Group mailing list
Archivesspace_Users_Group@lyralists.lyrasis.org<mailto:Archivesspace_Users_Group@lyralists.lyrasis.org>
http://lyralists.lyrasis.org/mailman/listinfo/archivesspace_users_group



--

Head of Archival Processing

University of Minnesota Libraries
Archives and Special Collections
Elmer L. Andersen Library, Suite 315
222-21st Ave. S.
Minneapolis MN 55455

Phone: 612.626.2531<tel:(612)%20626-2531>
_______________________________________________
Archivesspace_Users_Group mailing list
Archivesspace_Users_Group@lyralists.lyrasis.org<mailto:Archivesspace_Users_Group@lyralists.lyrasis.org>
http://lyralists.lyrasis.org/mailman/listinfo/archivesspace_users_group

_______________________________________________
Archivesspace_Users_Group mailing list
Archivesspace_Users_Group@lyralists.lyrasis.org<mailto:Archivesspace_Users_Group@lyralists.lyrasis.org>
http://lyralists.lyrasis.org/mailman/listinfo/archivesspace_users_group


_______________________________________________
Archivesspace_Users_Group mailing list
Archivesspace_Users_Group@lyralists.lyrasis.org<mailto:Archivesspace_Users_Group@lyralists.lyrasis.org>
http://lyralists.lyrasis.org/mailman/listinfo/archivesspace_users_group



--

Head of Archival Processing

University of Minnesota Libraries
Archives and Special Collections
Elmer L. Andersen Library, Suite 315
222-21st Ave. S.
Minneapolis MN 55455

Phone: 612.626.2531
_______________________________________________
Archivesspace_Users_Group mailing list
Archivesspace_Users_Group@lyralists.lyrasis.org
http://lyralists.lyrasis.org/mailman/listinfo/archivesspace_users_group

Reply via email to